4.5 billion data migration records follow-up-to-day data volume levels of 10 million to clickhouse
- Related Document Address
- flume Reference Address
- waterdrop Reference Address
- clickhouse Reference Address
- kafka Reference Address
- The environment log is on one server and the clickhouse cluster is on the other.
- thinking
- Previously, when migrating data, I did not consider this piece of data, so I will not start thinking and preparing until the whole data has been migrated.
- Idea 1: Flume reads the log file, then passes the log using flume's avro mode (here, avro passes the log because the log server does not have a hadoop environment, and also to keep the log server stable), drops the read log on hdfs, writes the rules, divides it by day, and passes it one data file a day.Then use waterdrop to dynamically pass parameter dates, write a crontab task for linux, call it once a day, pass yesterday's date each time, pull yesterday's data from HDFS to clickhouse.
The whole scenario is workable, preliminary testing has been done, but it doesn't match our scenario because the data on my side is relatively real-time and doesn't allow such a long day interval.
- Idea 2: Drop flume, use waterdrop to read the log directly, clean it, and go directly to clickhous e.For waterdrop processing logs, previous blogs can be read for reference.
This scenario is incorrect because I didn't know waterdrop correctly before. Waterdrop doesn't support real-time data processing. All of these are not feasible for log files, although I've processed them here and cleaned them from the log to get the data I need.
- Idea 3: flume reads the log log, cleans up the data, and writes kafka. Clickhouse already provides support for kafka's interaction with clickhouse. Check it out, clickhouse's external engine.Then Clickhouse suggests that kafka's engine be connected to kafka, then a materialized view is established for real-time consumption.
This scheme is feasible, in which many problems were encountered when using the clickhouse de external table engine kafka, and finally the official questions and the author's replies were checked to resolve them.
- Idea 4: flume reads data that falls on kafka, then uses waterdrop to periodically consume kafka,offset has Kafka automatic management for recording.
No test validation was performed.But someone in the waterdrop group did.
- Realization
- flume's avro requires attention to correlation.
-
The flume sender's slink needs to fill in the receiver's ip and port.My local sender is 104 and I accept 118.Related Configuration
# Define the names of the three components agent1.sources = source1 agent1.sinks = sink1 agent1.channels = channel1 # Configure source components agent1.sources.source1.type = exec agent1.sources.source1.command = tail -F /home/logs/gps.log # Copying data streams to multiple channel s #agent1.sources.source1.selector.type = replicating # Configuring sink components agent1.sinks.sink1.type = avro agent1.sinks.sink1.hostname = 192.168.108.118 agent1.sinks.sink1.port = 4141 agent1.sinks.sink1.request-timeout = 500000 # Configure channel Component agent1.channels.channel1.type = memory agent1.channels.channel1.capacity = 100000 agent1.channels.channel1.transactionCapacity = 10000 agent1.channels.channel1.keep.alive = 80 # Bind channel to source and sink agent1.sources.source1.channels = channel1 agent1.sinks.sink1.channel = channel1 agent1.sources.source1.interceptors = filter1 search-replace1 agent1.sources.source1.interceptors.filter1.type = REGEX_FILTER agent1.sources.source1.interceptors.filter1.regex = (The data received is:) agent1.sources.source1.interceptors.filter1.excludeEvents = false #agent1.sources.source1.interceptors.filter2.type = REGEX_FILTER #agent1.sources.source1.interceptors.filter2.regex = (Received data is:) #agent1.sources.source1.interceptors.filter2.excludeEvents = false #agent1.sources.source1.interceptors = search-replace1 agent1.sources.source1.interceptors.search-replace1.type = search_replace agent1.sources.source1.interceptors.search-replace1.searchPattern = [^{]*(?=\\{) agent1.sources.source1.interceptors.search-replace1.replaceString =
-
Here source is the address of the log file read
-
The rule for channel configuration is capacity > transactionCapacity > batchSize
-
The recipient is configured as follows
#Define the names of the three components agent1.sources = source1 agent1.sinks = sink1 agent1.channels = channel1 # Configure source components agent1.sources.source1.type = avro # The hostname or IP address to listen on agent1.sources.source1.bind = 192.168.108.118 agent1.sources.source1.port = 4141 # Configuring sink components agent1.sinks.sink1.type = hdfs agent1.sinks.sink1.hdfs.path = hdfs://cluster1/user/oracle/%Y-%m-%d #Flume creates a fixed prefix for new files under the HDFS folder agent1.sinks.sink1.hdfs.filePrefix = access_log #Maximum number of files allowed to open, if more than this number, the first open file will be closed #agent1.sinks.sink1.hdfs.maxOpenFiles = 5000 #Number of Event s per batch operation when writing to HDFS agent1.sinks.sink1.hdfs.batchSize= 5000 #File format, currently supported: SequenceFile, DataStream, CompressedStream.1. DataStream does not compress files and does not need to set hdfs.codeC 2. CompressedStream must set the hdfs.codeC parameter agent1.sinks.sink1.hdfs.fileType = DataStream #File Write Format agent1.sinks.sink1.hdfs.writeFormat = Text #Whether to use local timestamps when replacing escape sequences agent1.sinks.sink1.hdfs.useLocalTimeStamp = true #Suffix for Flume to create new files under HDFS folder #agent1.sinks.sink1.hdfs.fileSuffix = .txt #When the current file is written to that size, it triggers scrolling to create a new file (0 means that the file is not split according to its size) agent1.sinks.sink1.hdfs.rollSize = 0 #When the current file is written to that value, it triggers scrolling to create a new file (0 means that the file is not divided by time), in seconds agent1.sinks.sink1.hdfs.rollInterval = 0 #When the current file has been written to an Event of this number, it triggers a scrolling creation of a new file (0 means that the file is not split based on the number of Events) agent1.sinks.sink1.hdfs.rollCount = 0 #The number of milliseconds that HDFS operations are allowed, for example: open, write, flush, close.This configuration should increase if many HFDS operations time out. agent1.sinks.sink1.hdfs.callTimeout = 480000 #Number of IO operation threads per HDFS sink for HDFS (e.g., open, write) agent1.sinks.sink1.hdfs.threadsPoolSize = 10 #Number of times HDFS sink must attempt to rename a file after issuing a shutdown command agent1.sinks.sink1.hdfs.closeTries = 0 #Continuous attempts to close files in seconds agent1.sinks.sink1.hdfs.retryInterval = 60 # Configure channel Component agent1.channels.channel1.type = memory agent1.channels.channel1.transactionCapacity = 10000 agent1.channels.channel1.keep.alive = 80 agent1.channels.channel1.capacity = 100000 # Bind channel to source and sink agent1.sources.source1.channels = channel1 agent1.sinks.sink1.channel = channel1
-
The recipient's source is 118 here because it needs to publish its own port so that 114 can connect and transmit.It is recommended that you fill in your own ip instead of localhost and 127.0.0.0
-
When sending read log files, note that F is uppercase, and note the distinction between this and linux.
-
- kafka-related issues and cluster building.
-
The colleagues here have all been set up before and can be used directly.The flume configuration file written to it is as follows (here is just the configuration of sink1)
# Configuring sink components agent1.sinks.sink1.type = org.apache.flume.sink.kafka.KafkaSink #agent1.sinks.sink1.kafka.bootstrap.servers = master:9092,slave1:9092,slave2:9092 agent1.sinks.sink1.kafka.bootstrap.servers = 192.168.1.8.111:9092,192.168.108.112:9092,192.168.108.113:9092,192.168.108.118:9092,192.168.108.119:9092 # agent1.sinks.sink1.brokerList = master:9092,slave1:9092,slave2:9092 agent1.sinks.sink1.topic = s agent1.sinks.sink1.kafka.flumeBatchSize = 20 agent1.sinks.sink1.kafka.producer.acks = 1 agent1.sinks.sink1.kafka.producer.linger.ms = 1 #agent1.sinks.sink1.kafka.compression.type = snappy
-
kafka related commands and principles, refer to the documentation address above
-
- clickhouse correlation
-
kafka's external table engine
CREATE TABLE kafka_structure_test (encrypt String,date DateTime,lon String,lat String,vec1 String,vec2 String,vec3 String,direction String,altitude String,state String,alarm String,vehicleNo String,vehicleColor String,id String,createBy String,createDt DateTime) ENGINE = Kafka SETTINGS kafka_broker_list = '192.168.108.118:9092,192.168.108.119:9092', kafka_topic_list = 'wl_vehicle_data_clean', kafka_group_name = 'wl_vehicle_data_up', kafka_format = 'JSONEachRow', kafka_row_delimiter = '\n', kafka_num_consumers = 1,kafka_max_block_size = 500;
-
Document reference for kafka engine Document Address
-
Materialized view building
CREATE MATERIALIZED VIEW consumer TO t_plt_vehicle_location_test AS select id,encrypt,date as up_date,lon,createBy as create_by,createDt as create_dt,lat,vec1,vec2,vec3,direction,altitude,state,alarm, vehicleNo as vehicleno,vehicleColor as vehiclecolor from kafka_structure_test ;
-
Establishment of a test table for clickhouse
create table t_plt_vehicle_location_test\ (\ id String default 'MSG0',\ encrypt String default '0',\ up_date DateTime default '1970-01-01 00:00:01',\ lon String default -1,\ create_by String default 'UP_EXG_MSG_REAL_LOCATION',\ create_dt DateTime default now() ,\ lat String default -1,\ vec1 String default -1,\ vec2 String default -1,\ vec3 String default -1,\ direction String default -1,\ altitude String default -1,\ state String default -1,\ alarm String default -1,\ vehicleno String default '-1',\ vehiclecolor String default '-1',\ alarm_code String default '-1'\ )\ ENGINE = MergeTree() partition by toYYYYMM(up_date) ORDER BY\ (vehicleno,up_date) SETTINGS index_granularity = 8192
-
When we used the Kafka engine of clickhouse, we encountered a problem. After the Kafka engine was connected to kafka, it automatically dropped off the line at intervals, the connection was interrupted and could not be consumed. After checking github's related questions, we found that this is a bug related to clickhouse. It was fixed in ClickHouse version 19.13.2.19. Related Links
-
Follow-up questions can also be viewed above.Error checking clickhouse's log.Or look at the offset of consumer spending at kafka to see if the kafka engine is still connected to kafka for consumption.
-
- flume's avro requires attention to correlation.
- Related Issues
- The clickhouse issue, the clickhouse related support for kafka, is not stable, and the upgraded version we are looking at here is also a problem that was not fixed until the end of 19 years.
- If you have a small problem and no one else has a problem, look carefully at it three times, because it's mostly because you made the basic low-level error and no one else did.So when you ask someone else, most of them don't know.
- Care, care.When you encounter an error, don't trust your previous logical judgment.Comb from new analysis.A little bit of validation.