waterdrop filtering processing log files

Keywords: Database Java Spark SQL JSON

waterdrop filters and processes log files to store data

  • Installing waterdrop

    • Download the installation package of waterdrop using wget
      wget xxxxx
      
    • Extract to the directory you need
      Unzip XXX (package location) XXX (decompression location) 
      
      If unzip reports an error, please download the corresponding command yourself.
    • Set the dependency environment in the directory's config, java Spark
      vim ./waterdrop-env.sh
      
      
      #!/usr/bin/env bash
      # Home directory of spark distribution.
      
      SPARK_HOME=/extends/soft/spark-2.4.4
      
      JAVA_HOME=/usr/java/jdk1.8.0_202-amd64
      
      HADOOP_HOME=/extends/soft/hadoop-2.7.4
      
      
    • Enter an example before copying the config and make changes.
  • Set up config configuration file to process data

    • Because what I do here is to read the log log. Filter valid data. Deposit in

    • Profile post

    ######
    ###### This config file is a demonstration of batch processing in waterdrop config
    ######
    
    spark {
    # You can set spark configuration here
    # see available properties defined by spark: https://spark.apache.org/docs/latest/configuration.html#available-properties
    spark.app.name = "Waterdrop"
    spark.executor.instances = 2
    spark.executor.cores = 1
    spark.executor.memory = "1g"
    }
    
    input {
    # This is a example input plugin **only for test and demonstrate the feature input plugin**
    #  fake {
    #    result_table_name = "my_dataset"
    #  }
    
    file {
        path = "file:///home/logs/gps.log"
        result_table_name = "gps"
        format = "text"
    }
    
    # You can also use other input plugins, such as hdfs
    # hdfs {
    #   result_table_name = "accesslog"
    #   path = "hdfs://hadoop-cluster-01/nginx/accesslog"
    #   format = "json"
    # }
    
    # If you would like to get more information about how to configure waterdrop and see full list of input plugins,
    # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base
    }
    
    filter {
    # split data by specific delimiter
    #  split {
    #    fields = ["msg", "name"]
    #    delimiter = " "
    #    result_table_name = "accesslog"
    #  }
    
    
    sql {
            sql = "select * from gps  where raw_message like '%The received data is:%' "
        }
    
    split {
        source_field = "raw_message"
        delimiter = "The received data is:"
        fields = ["field1", "field2"]
    }
    
    json {
        source_field = "field2"
        result_table_name = "gps_test"
    }
    
    sql {
            sql = "select concat('',encrypt) as encrypt,`date` as up_date,concat('',lon) as lon,concat('',lat) as lat,concat('',vec1) as vec1,concat('',vec2) as vec2,concat('',vec3) as vec3,concat('',direction) as direction,concat('',altitude) as altitude,concat('',state) as state,concat('',alarm) as alarm,concat('',vehicleno) as vehicleno,concat('',vehiclecolor) as vehiclecolor,id,createBy as create_by,createDt as create_dt  from gps_test  where LENGTH(date) = 19"
        }
    
    # you can also you other filter plugins, such as sql
    # sql {
    #   sql = "select * from accesslog where request_time > 1000"
    # }
    
    # If you would like to get more information about how to configure waterdrop and see full list of filter plugins,
    # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base
    }
    
    output {
    # choose stdout output plugin to output data to console
    #  stdout {
    #  }
    
    clickhouse {
            host = "hadoop-4:8123"
            clickhouse.socket_timeout = 50000
            database = "wlpt_01"
            table = "t_plt_vehicle_location_test"
            fields = ["id","encrypt","up_date","lon","create_by","create_dt","lat","vec1","vec2","vec3","direction","altitude","state","alarm","vehicleno","vehiclecolor"]
            username = "default"
            password = "********"
            bulk_size = 5
            retry = 3
        }
    
    
    
    # you can also you other output plugins, such as sql
    # hdfs {
    #   path = "hdfs://hadoop-cluster-01/nginx/accesslog_processed"
    #   save_mode = "append"
    # }
    
    # If you would like to get more information about how to configure waterdrop and see full list of output plugins,
    # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base
    }
    
    • thinking
      • Log data format
      2020-01-03 13:36:23,967 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 30] head -> 91 ; tail -> 93
      2020-01-03 13:36:23,992 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 65] xxxxxxxxx:  4608 ; ########:  1111
      2020-01-03 13:36:23,993 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 39] msg -> Message [msgLength=90, encryptFlag=0, msgGesscenterId=1111, encryptKey=52559, crcCode=14014, msgId=4608, msgSn=61937, msgBody=BigEndianHeapChannelBuffer(ridx=0, widx=64, cap=64), versionFlag=[0, 0, 1]]
      2020-01-03 13:36:23,993 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.busihandler [busihandler. Java: 42] receive positioning data
      2020-01-03 13:36:23,995 [INFO] [New I/O server worker #1-8] org.jboss.netty.handler.logging.LoggingHandler [JdkLogger.java : 58] [id: 0x1e24397b, /10.228.30.192:59894 => /10.228.30.215:8082] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=115, cap=115) - (HEXDUMP: 5b0000007306ce969412000000045701020f0000000000c9c2483132353538000000000000000000000000000212010000003d3530373334000000000000000000000000000000000000000000000000000000000000000000000000000000000000000030313339393134363431383920d35d)
      2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 30] head -> 91 ; tail -> 93
      2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 65] xxxxxxxxx:  4608 ; ########:  1111
      2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 39] msg -> Message [msgLength=115, encryptFlag=0, msgGesscenterId=1111, encryptKey=0, crcCode=8403, msgId=4608, msgSn=114202260, msgBody=BigEndianHeapChannelBuffer(ridx=0, widx=89, cap=89), versionFlag=[1, 2, 15]]
      2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.busihandler [busihandler. Java: 42] receive positioning data
      2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] org.jboss.netty.handler.logging.LoggingHandler [JdkLogger.java : 58] [id: 0x1e24397b, /10.228.30.192:59894 => /10.228.30.215:8082] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=91, cap=91) - (HEXDUMP: 5b0000005a0206ce969512000000045701020f0000000000c9c2483132353538000000000000000000000000000212020000002400030107e40d24100690c4100208318f00000000000168e300b500030000100300000000a78f5d)
      2020-01-03 13:36:23,997 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 30] head -> 91 ; tail -> 93
      2020-01-03 13:36:23,997 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 65] xxxxxxxxx:  4608 ; ########:  1111
      2020-01-03 13:36:23,997 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 39] msg -> Message [msgLength=90, encryptFlag=0, msgGesscenterId=1111, encryptKey=0, crcCode=42895, msgId=4608, msgSn=114202261, msgBody=BigEndianHeapChannelBuffer(ridx=0, widx=64, cap=64), versionFlag=[1, 2, 15]]
      2020-01-03 13:36:23,997 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.busihandler [busihandler. Java: 42] receive positioning data
      2020-01-03 13:36:23,997 [INFO] [Thread-36511178] com.xn.logistics.gps.server.BusiHandler [BusiHandler.java : 114] The received data is:{"encrypt":0,"date":"2020-01-03 13:36:16","lon":"1.1015067","lat":"3.409140","vec1":0,"vec2":0,"vec3":92387,"direction":181,"altitude":3,"state":4099,"alarm":0,"vehicleNo":"Shaanxi Province H12558","vehicleColor":2,"id":"MSG Shaanxi Province H12558","createBy":"UP_EXG_MSG_REAL_LOCATION","createDt":"2020-01-03 13:36:23"}
      2020-01-03 13:36:24,006 [INFO] [New I/O server worker #1-8] org.jboss.netty.handler.logging.LoggingHandler [JdkLogger.java : 58] [id: 0x1e24397b, /10.228.30.192:59894 => /10.228.30.215:8082] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=115, cap=115) - (HEXDUMP: 5b00000073016a52f61200000004570100000000000000c9c2444235323136000000000000000000000000000212010000003d353039383900000000000000000000000000000000000000000000000000000000000000000000000000000000000000003031343239383831343430300b845d)
      
      • Createdt ":" 2020-01-03 13:36:23 "}
      • I need to get the json here, then parse it, and put it into the corresponding table fields of the database one by one according to the fields.
    • The first consideration is regular filtering, filtering out the received data as the following json, and then json parsing and warehousing.
    • When using select * from GPS raw_messagerlike '{. *} $' for regular filtering, an error is reported and the author Ricky Hoo is asked. He suggests that I use sql like
    • Regular cannot continue to be used and cannot be obtained normally. Try to use like. Note that the usage is similar to like in mysql. Use raw message to let him receive data for each row:
       sql {
              sql = "select * from gps  where raw_message like '%The received data is:%' "
          }
      
    • 99, "alarm": 0, "V Ehicleno ":" Shaan H12558","vehicleColor":2,"id":"MSG Shaan h12558 "," createby ":" up "ExG" MSG "real" location "," createdt ":" 2020-01-03 13:36:23 "}
    • split is used for segmentation. The segmentation point is the received data: in this way, the data is finally divided into three parts, as follows. RM: 0, "vehicleno": "Shaan h12558", "vehiclecolor": 2, "Id":“ MSG Shaan h12558 "," createby ":" up "ExG" MSG "real" location "," createdt ":" 2020-01-03 13:36:23 "} the middle" | "
       split {
              source_field = "raw_message"
              delimiter = "The received data is:"
              fields = ["field1", "field2"]
          }
      
    • In this way, we have split json. In the previous split split, we have defined the corresponding fields according to the results. Please refer to the configuration file above. Next, parse the json. Using json
       json {
          source_field = "field2"
          result_table_name = "gps_test"
       }
      
    • Here, we have successfully parsed and got the result, but we need a new table to accept the previous processing results, so I rename the table in the result table name above
  • Run, view storage

    • Finally, write the output for writing, check the clickhouse table, and check whether the writing is successful

At the beginning of debugging, use stdout output, so that the results of each step are presented in the log. Can be more convenient for debugging.

output {
  # choose stdout output plugin to output data to console
#  stdout {
#  }
}

waterdrop document Reference link

Thanks to the author of waterdrop, the guidance and help in the use process, and thanks to Ricky Hoo.

Posted by PhantomCube on Mon, 13 Jan 2020 01:04:18 -0800