4.5 billion data migration records follow-up-to-day data volume levels of 10 million to clickhouse

Keywords: Database kafka Linux codec Hadoop

4.5 billion data migration records follow-up-to-day data volume levels of 10 million to clickhouse

  • Related Document Address
  • The environment log is on one server and the clickhouse cluster is on the other.
  • thinking
    • Previously, when migrating data, I did not consider this piece of data, so I will not start thinking and preparing until the whole data has been migrated.
    • Idea 1: Flume reads the log file, then passes the log using flume's avro mode (here, avro passes the log because the log server does not have a hadoop environment, and also to keep the log server stable), drops the read log on hdfs, writes the rules, divides it by day, and passes it one data file a day.Then use waterdrop to dynamically pass parameter dates, write a crontab task for linux, call it once a day, pass yesterday's date each time, pull yesterday's data from HDFS to clickhouse.

      The whole scenario is workable, preliminary testing has been done, but it doesn't match our scenario because the data on my side is relatively real-time and doesn't allow such a long day interval.

    • Idea 2: Drop flume, use waterdrop to read the log directly, clean it, and go directly to clickhous e.For waterdrop processing logs, previous blogs can be read for reference.

      This scenario is incorrect because I didn't know waterdrop correctly before. Waterdrop doesn't support real-time data processing. All of these are not feasible for log files, although I've processed them here and cleaned them from the log to get the data I need.

    • Idea 3: flume reads the log log, cleans up the data, and writes kafka. Clickhouse already provides support for kafka's interaction with clickhouse. Check it out, clickhouse's external engine.Then Clickhouse suggests that kafka's engine be connected to kafka, then a materialized view is established for real-time consumption.

      This scheme is feasible, in which many problems were encountered when using the clickhouse de external table engine kafka, and finally the official questions and the author's replies were checked to resolve them.

    • Idea 4: flume reads data that falls on kafka, then uses waterdrop to periodically consume kafka,offset has Kafka automatic management for recording.

      No test validation was performed.But someone in the waterdrop group did.

  • Realization
    • flume's avro requires attention to correlation.
      • The flume sender's slink needs to fill in the receiver's ip and port.My local sender is 104 and I accept 118.Related Configuration

        # Define the names of the three components
        agent1.sources = source1
        agent1.sinks = sink1
        agent1.channels = channel1
        
        # Configure source components
        agent1.sources.source1.type = exec
        agent1.sources.source1.command = tail -F /home/logs/gps.log
        
        # Copying data streams to multiple channel s
        #agent1.sources.source1.selector.type = replicating
        
        # Configuring sink components
        agent1.sinks.sink1.type = avro
        agent1.sinks.sink1.hostname = 192.168.108.118
        agent1.sinks.sink1.port = 4141
        agent1.sinks.sink1.request-timeout = 500000
        
        # Configure channel Component
        agent1.channels.channel1.type = memory
        agent1.channels.channel1.capacity = 100000
        agent1.channels.channel1.transactionCapacity = 10000
        agent1.channels.channel1.keep.alive = 80
        
        # Bind channel to source and sink
        agent1.sources.source1.channels = channel1
        agent1.sinks.sink1.channel = channel1
        
        agent1.sources.source1.interceptors = filter1 search-replace1
        agent1.sources.source1.interceptors.filter1.type = REGEX_FILTER
        agent1.sources.source1.interceptors.filter1.regex = (The data received is:)
        agent1.sources.source1.interceptors.filter1.excludeEvents = false
        
        #agent1.sources.source1.interceptors.filter2.type = REGEX_FILTER
        #agent1.sources.source1.interceptors.filter2.regex = (Received data is:)
        #agent1.sources.source1.interceptors.filter2.excludeEvents = false
        
        #agent1.sources.source1.interceptors = search-replace1
        agent1.sources.source1.interceptors.search-replace1.type = search_replace
        agent1.sources.source1.interceptors.search-replace1.searchPattern = [^{]*(?=\\{)
        agent1.sources.source1.interceptors.search-replace1.replaceString =
        
        
      • Here source is the address of the log file read

      • The rule for channel configuration is capacity > transactionCapacity > batchSize

      • The recipient is configured as follows

        #Define the names of the three components
        agent1.sources = source1
        agent1.sinks = sink1
        agent1.channels = channel1
        
        
        # Configure source components
        agent1.sources.source1.type = avro
        # The hostname or IP address to listen on
        agent1.sources.source1.bind = 192.168.108.118
        agent1.sources.source1.port = 4141
        
        
        # Configuring sink components
        agent1.sinks.sink1.type = hdfs
        agent1.sinks.sink1.hdfs.path = hdfs://cluster1/user/oracle/%Y-%m-%d
        #Flume creates a fixed prefix for new files under the HDFS folder
        agent1.sinks.sink1.hdfs.filePrefix = access_log
        #Maximum number of files allowed to open, if more than this number, the first open file will be closed
        #agent1.sinks.sink1.hdfs.maxOpenFiles = 5000
        #Number of Event s per batch operation when writing to HDFS
        agent1.sinks.sink1.hdfs.batchSize= 5000
        #File format, currently supported: SequenceFile, DataStream, CompressedStream.1. DataStream does not compress files and does not need to set hdfs.codeC 2. CompressedStream must set the hdfs.codeC parameter
        agent1.sinks.sink1.hdfs.fileType = DataStream
        #File Write Format
        agent1.sinks.sink1.hdfs.writeFormat = Text
        #Whether to use local timestamps when replacing escape sequences
        agent1.sinks.sink1.hdfs.useLocalTimeStamp = true
        #Suffix for Flume to create new files under HDFS folder
        #agent1.sinks.sink1.hdfs.fileSuffix = .txt
        #When the current file is written to that size, it triggers scrolling to create a new file (0 means that the file is not split according to its size)
        agent1.sinks.sink1.hdfs.rollSize = 0
        #When the current file is written to that value, it triggers scrolling to create a new file (0 means that the file is not divided by time), in seconds
        agent1.sinks.sink1.hdfs.rollInterval = 0
        #When the current file has been written to an Event of this number, it triggers a scrolling creation of a new file (0 means that the file is not split based on the number of Events)
        agent1.sinks.sink1.hdfs.rollCount = 0
        #The number of milliseconds that HDFS operations are allowed, for example: open, write, flush, close.This configuration should increase if many HFDS operations time out.
        agent1.sinks.sink1.hdfs.callTimeout = 480000
        #Number of IO operation threads per HDFS sink for HDFS (e.g., open, write)
        agent1.sinks.sink1.hdfs.threadsPoolSize = 10
        #Number of times HDFS sink must attempt to rename a file after issuing a shutdown command
        agent1.sinks.sink1.hdfs.closeTries = 0
        #Continuous attempts to close files in seconds
        agent1.sinks.sink1.hdfs.retryInterval = 60
        
        
        # Configure channel Component
        agent1.channels.channel1.type = memory
        agent1.channels.channel1.transactionCapacity = 10000
        agent1.channels.channel1.keep.alive = 80
        agent1.channels.channel1.capacity = 100000
        
        
        
        # Bind channel to source and sink
        agent1.sources.source1.channels = channel1
        agent1.sinks.sink1.channel = channel1
        
        
      • The recipient's source is 118 here because it needs to publish its own port so that 114 can connect and transmit.It is recommended that you fill in your own ip instead of localhost and 127.0.0.0

      • When sending read log files, note that F is uppercase, and note the distinction between this and linux.

    • kafka-related issues and cluster building.
      • The colleagues here have all been set up before and can be used directly.The flume configuration file written to it is as follows (here is just the configuration of sink1)

        # Configuring sink components
        agent1.sinks.sink1.type = org.apache.flume.sink.kafka.KafkaSink
        #agent1.sinks.sink1.kafka.bootstrap.servers = master:9092,slave1:9092,slave2:9092
        agent1.sinks.sink1.kafka.bootstrap.servers = 192.168.1.8.111:9092,192.168.108.112:9092,192.168.108.113:9092,192.168.108.118:9092,192.168.108.119:9092
        # agent1.sinks.sink1.brokerList = master:9092,slave1:9092,slave2:9092
        agent1.sinks.sink1.topic = s
        agent1.sinks.sink1.kafka.flumeBatchSize = 20
        agent1.sinks.sink1.kafka.producer.acks = 1
        agent1.sinks.sink1.kafka.producer.linger.ms = 1
        #agent1.sinks.sink1.kafka.compression.type = snappy
        
        
      • kafka related commands and principles, refer to the documentation address above

    • clickhouse correlation
      • kafka's external table engine

        CREATE TABLE kafka_structure_test (encrypt String,date DateTime,lon String,lat String,vec1 String,vec2 String,vec3 String,direction String,altitude String,state String,alarm String,vehicleNo String,vehicleColor String,id String,createBy String,createDt DateTime) ENGINE = Kafka SETTINGS kafka_broker_list = '192.168.108.118:9092,192.168.108.119:9092',  kafka_topic_list = 'wl_vehicle_data_clean',  kafka_group_name = 'wl_vehicle_data_up', kafka_format = 'JSONEachRow', kafka_row_delimiter = '\n', kafka_num_consumers = 1,kafka_max_block_size = 500;
        
        
      • Document reference for kafka engine Document Address

      • Materialized view building

        CREATE MATERIALIZED VIEW consumer TO  t_plt_vehicle_location_test  AS  select id,encrypt,date as up_date,lon,createBy as create_by,createDt as create_dt,lat,vec1,vec2,vec3,direction,altitude,state,alarm, vehicleNo as vehicleno,vehicleColor as vehiclecolor from  kafka_structure_test ;
        
      • Establishment of a test table for clickhouse

        create table t_plt_vehicle_location_test\
        (\
        id String default 'MSG0',\
        encrypt String default '0',\
        up_date DateTime default '1970-01-01 00:00:01',\
        lon String default -1,\
        create_by String default 'UP_EXG_MSG_REAL_LOCATION',\
        create_dt DateTime default now() ,\
        lat String default -1,\
        vec1 String default -1,\
        vec2 String default -1,\
        vec3 String default -1,\
        direction String default -1,\
        altitude String default -1,\
        state String default -1,\
        alarm String default -1,\
        vehicleno String default '-1',\
        vehiclecolor String default '-1',\
        alarm_code  String default '-1'\
        )\
        ENGINE = MergeTree() partition by toYYYYMM(up_date) ORDER BY\
        (vehicleno,up_date) SETTINGS index_granularity = 8192
        
      • When we used the Kafka engine of clickhouse, we encountered a problem. After the Kafka engine was connected to kafka, it automatically dropped off the line at intervals, the connection was interrupted and could not be consumed. After checking github's related questions, we found that this is a bug related to clickhouse. It was fixed in ClickHouse version 19.13.2.19. Related Links

      • Follow-up questions can also be viewed above.Error checking clickhouse's log.Or look at the offset of consumer spending at kafka to see if the kafka engine is still connected to kafka for consumption.

  • Related Issues
    • The clickhouse issue, the clickhouse related support for kafka, is not stable, and the upgraded version we are looking at here is also a problem that was not fixed until the end of 19 years.
    • If you have a small problem and no one else has a problem, look carefully at it three times, because it's mostly because you made the basic low-level error and no one else did.So when you ask someone else, most of them don't know.
    • Care, care.When you encounter an error, don't trust your previous logical judgment.Comb from new analysis.A little bit of validation.

Posted by gszauer on Mon, 20 Jan 2020 18:52:36 -0800