4.5 billion data migration records follow-up-to-day data volume levels of 10 million to clickhouse

Keywords: Database kafka Linux codec Hadoop

4.5 billion data migration records follow-up-to-day data volume levels of 10 million to clickhouse

Related Document Address
- flume Reference Address
- waterdrop Reference Address
- clickhouse Reference Address
- kafka Reference Address
The environment log is on one server and the clickhouse cluster is on the other.
thinking
- Previously, when migrating data, I did not consider this piece of data, so I will not start thinking and preparing until the whole data has been migrated.
- Idea 1: Flume reads the log file, then passes the log using flume's avro mode (here, avro passes the log because the log server does not have a hadoop environment, and also to keep the log server stable), drops the read log on hdfs, writes the rules, divides it by day, and passes it one data file a day.Then use waterdrop to dynamically pass parameter dates, write a crontab task for linux, call it once a day, pass yesterday's date each time, pull yesterday's data from HDFS to clickhouse.
  
  The whole scenario is workable, preliminary testing has been done, but it doesn't match our scenario because the data on my side is relatively real-time and doesn't allow such a long day interval.
- Idea 2: Drop flume, use waterdrop to read the log directly, clean it, and go directly to clickhous e.For waterdrop processing logs, previous blogs can be read for reference.
  
  This scenario is incorrect because I didn't know waterdrop correctly before. Waterdrop doesn't support real-time data processing. All of these are not feasible for log files, although I've processed them here and cleaned them from the log to get the data I need.
- Idea 3: flume reads the log log, cleans up the data, and writes kafka. Clickhouse already provides support for kafka's interaction with clickhouse. Check it out, clickhouse's external engine.Then Clickhouse suggests that kafka's engine be connected to kafka, then a materialized view is established for real-time consumption.
  
  This scheme is feasible, in which many problems were encountered when using the clickhouse de external table engine kafka, and finally the official questions and the author's replies were checked to resolve them.
- Idea 4: flume reads data that falls on kafka, then uses waterdrop to periodically consume kafka,offset has Kafka automatic management for recording.
  
  No test validation was performed.But someone in the waterdrop group did.

Realization

flume's avro requires attention to correlation.

The flume sender's slink needs to fill in the receiver's ip and port.My local sender is 104 and I accept 118.Related Configuration

# Define the names of the three components
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1

# Configure source components
agent1.sources.source1.type = exec
agent1.sources.source1.command = tail -F /home/logs/gps.log

# Copying data streams to multiple channel s
#agent1.sources.source1.selector.type = replicating

# Configuring sink components
agent1.sinks.sink1.type = avro
agent1.sinks.sink1.hostname = 192.168.108.118
agent1.sinks.sink1.port = 4141
agent1.sinks.sink1.request-timeout = 500000

# Configure channel Component
agent1.channels.channel1.type = memory
agent1.channels.channel1.capacity = 100000
agent1.channels.channel1.transactionCapacity = 10000
agent1.channels.channel1.keep.alive = 80

# Bind channel to source and sink
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

agent1.sources.source1.interceptors = filter1 search-replace1
agent1.sources.source1.interceptors.filter1.type = REGEX_FILTER
agent1.sources.source1.interceptors.filter1.regex = (The data received is:)
agent1.sources.source1.interceptors.filter1.excludeEvents = false

#agent1.sources.source1.interceptors.filter2.type = REGEX_FILTER
#agent1.sources.source1.interceptors.filter2.regex = (Received data is:)
#agent1.sources.source1.interceptors.filter2.excludeEvents = false

#agent1.sources.source1.interceptors = search-replace1
agent1.sources.source1.interceptors.search-replace1.type = search_replace
agent1.sources.source1.interceptors.search-replace1.searchPattern = [^{]*(?=\\{)
agent1.sources.source1.interceptors.search-replace1.replaceString =

Here source is the address of the log file read
The rule for channel configuration is capacity > transactionCapacity > batchSize

The recipient is configured as follows

#Define the names of the three components
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1


# Configure source components
agent1.sources.source1.type = avro
# The hostname or IP address to listen on
agent1.sources.source1.bind = 192.168.108.118
agent1.sources.source1.port = 4141


# Configuring sink components
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = hdfs://cluster1/user/oracle/%Y-%m-%d
#Flume creates a fixed prefix for new files under the HDFS folder
agent1.sinks.sink1.hdfs.filePrefix = access_log
#Maximum number of files allowed to open, if more than this number, the first open file will be closed
#agent1.sinks.sink1.hdfs.maxOpenFiles = 5000
#Number of Event s per batch operation when writing to HDFS
agent1.sinks.sink1.hdfs.batchSize= 5000
#File format, currently supported: SequenceFile, DataStream, CompressedStream.1. DataStream does not compress files and does not need to set hdfs.codeC 2. CompressedStream must set the hdfs.codeC parameter
agent1.sinks.sink1.hdfs.fileType = DataStream
#File Write Format
agent1.sinks.sink1.hdfs.writeFormat = Text
#Whether to use local timestamps when replacing escape sequences
agent1.sinks.sink1.hdfs.useLocalTimeStamp = true
#Suffix for Flume to create new files under HDFS folder
#agent1.sinks.sink1.hdfs.fileSuffix = .txt
#When the current file is written to that size, it triggers scrolling to create a new file (0 means that the file is not split according to its size)
agent1.sinks.sink1.hdfs.rollSize = 0
#When the current file is written to that value, it triggers scrolling to create a new file (0 means that the file is not divided by time), in seconds
agent1.sinks.sink1.hdfs.rollInterval = 0
#When the current file has been written to an Event of this number, it triggers a scrolling creation of a new file (0 means that the file is not split based on the number of Events)
agent1.sinks.sink1.hdfs.rollCount = 0
#The number of milliseconds that HDFS operations are allowed, for example: open, write, flush, close.This configuration should increase if many HFDS operations time out.
agent1.sinks.sink1.hdfs.callTimeout = 480000
#Number of IO operation threads per HDFS sink for HDFS (e.g., open, write)
agent1.sinks.sink1.hdfs.threadsPoolSize = 10
#Number of times HDFS sink must attempt to rename a file after issuing a shutdown command
agent1.sinks.sink1.hdfs.closeTries = 0
#Continuous attempts to close files in seconds
agent1.sinks.sink1.hdfs.retryInterval = 60


# Configure channel Component
agent1.channels.channel1.type = memory
agent1.channels.channel1.transactionCapacity = 10000
agent1.channels.channel1.keep.alive = 80
agent1.channels.channel1.capacity = 100000



# Bind channel to source and sink
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

The recipient's source is 118 here because it needs to publish its own port so that 114 can connect and transmit.It is recommended that you fill in your own ip instead of localhost and 127.0.0.0
When sending read log files, note that F is uppercase, and note the distinction between this and linux.

kafka-related issues and cluster building.

The colleagues here have all been set up before and can be used directly.The flume configuration file written to it is as follows (here is just the configuration of sink1)

# Configuring sink components
agent1.sinks.sink1.type = org.apache.flume.sink.kafka.KafkaSink
#agent1.sinks.sink1.kafka.bootstrap.servers = master:9092,slave1:9092,slave2:9092
agent1.sinks.sink1.kafka.bootstrap.servers = 192.168.1.8.111:9092,192.168.108.112:9092,192.168.108.113:9092,192.168.108.118:9092,192.168.108.119:9092
# agent1.sinks.sink1.brokerList = master:9092,slave1:9092,slave2:9092
agent1.sinks.sink1.topic = s
agent1.sinks.sink1.kafka.flumeBatchSize = 20
agent1.sinks.sink1.kafka.producer.acks = 1
agent1.sinks.sink1.kafka.producer.linger.ms = 1
#agent1.sinks.sink1.kafka.compression.type = snappy

kafka related commands and principles, refer to the documentation address above

clickhouse correlation

kafka's external table engine

CREATE TABLE kafka_structure_test (encrypt String,date DateTime,lon String,lat String,vec1 String,vec2 String,vec3 String,direction String,altitude String,state String,alarm String,vehicleNo String,vehicleColor String,id String,createBy String,createDt DateTime) ENGINE = Kafka SETTINGS kafka_broker_list = '192.168.108.118:9092,192.168.108.119:9092',  kafka_topic_list = 'wl_vehicle_data_clean',  kafka_group_name = 'wl_vehicle_data_up', kafka_format = 'JSONEachRow', kafka_row_delimiter = '\n', kafka_num_consumers = 1,kafka_max_block_size = 500;

Document reference for kafka engine Document Address

Materialized view building

CREATE MATERIALIZED VIEW consumer TO  t_plt_vehicle_location_test  AS  select id,encrypt,date as up_date,lon,createBy as create_by,createDt as create_dt,lat,vec1,vec2,vec3,direction,altitude,state,alarm, vehicleNo as vehicleno,vehicleColor as vehiclecolor from  kafka_structure_test ;

Establishment of a test table for clickhouse

create table t_plt_vehicle_location_test\
(\
id String default 'MSG0',\
encrypt String default '0',\
up_date DateTime default '1970-01-01 00:00:01',\
lon String default -1,\
create_by String default 'UP_EXG_MSG_REAL_LOCATION',\
create_dt DateTime default now() ,\
lat String default -1,\
vec1 String default -1,\
vec2 String default -1,\
vec3 String default -1,\
direction String default -1,\
altitude String default -1,\
state String default -1,\
alarm String default -1,\
vehicleno String default '-1',\
vehiclecolor String default '-1',\
alarm_code  String default '-1'\
)\
ENGINE = MergeTree() partition by toYYYYMM(up_date) ORDER BY\
(vehicleno,up_date) SETTINGS index_granularity = 8192

When we used the Kafka engine of clickhouse, we encountered a problem. After the Kafka engine was connected to kafka, it automatically dropped off the line at intervals, the connection was interrupted and could not be consumed. After checking github's related questions, we found that this is a bug related to clickhouse. It was fixed in ClickHouse version 19.13.2.19. Related Links
Follow-up questions can also be viewed above.Error checking clickhouse's log.Or look at the offset of consumer spending at kafka to see if the kafka engine is still connected to kafka for consumption.

Related Issues
- The clickhouse issue, the clickhouse related support for kafka, is not stable, and the upgraded version we are looking at here is also a problem that was not fixed until the end of 19 years.
- If you have a small problem and no one else has a problem, look carefully at it three times, because it's mostly because you made the basic low-level error and no one else did.So when you ask someone else, most of them don't know.
- Care, care.When you encounter an error, don't trust your previous logical judgment.Comb from new analysis.A little bit of validation.

Posted by gszauer on Mon, 20 Jan 2020 18:52:36 -0800

Programmer Group

4.5 billion data migration records follow-up-to-day data volume levels of 10 million to clickhouse

4.5 billion data migration records follow-up-to-day data volume levels of 10 million to clickhouse

Hot Keywords