waterdrop filters and processes log files to store data
-
Installing waterdrop
- Download the installation package of waterdrop using wget
wget xxxxx
- Extract to the directory you need
If unzip reports an error, please download the corresponding command yourself.
Unzip XXX (package location) XXX (decompression location)
- Set the dependency environment in the directory's config, java Spark
vim ./waterdrop-env.sh #!/usr/bin/env bash # Home directory of spark distribution. SPARK_HOME=/extends/soft/spark-2.4.4 JAVA_HOME=/usr/java/jdk1.8.0_202-amd64 HADOOP_HOME=/extends/soft/hadoop-2.7.4
- Enter an example before copying the config and make changes.
- Download the installation package of waterdrop using wget
-
Set up config configuration file to process data
-
Because what I do here is to read the log log. Filter valid data. Deposit in
-
Profile post
###### ###### This config file is a demonstration of batch processing in waterdrop config ###### spark { # You can set spark configuration here # see available properties defined by spark: https://spark.apache.org/docs/latest/configuration.html#available-properties spark.app.name = "Waterdrop" spark.executor.instances = 2 spark.executor.cores = 1 spark.executor.memory = "1g" } input { # This is a example input plugin **only for test and demonstrate the feature input plugin** # fake { # result_table_name = "my_dataset" # } file { path = "file:///home/logs/gps.log" result_table_name = "gps" format = "text" } # You can also use other input plugins, such as hdfs # hdfs { # result_table_name = "accesslog" # path = "hdfs://hadoop-cluster-01/nginx/accesslog" # format = "json" # } # If you would like to get more information about how to configure waterdrop and see full list of input plugins, # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base } filter { # split data by specific delimiter # split { # fields = ["msg", "name"] # delimiter = " " # result_table_name = "accesslog" # } sql { sql = "select * from gps where raw_message like '%The received data is:%' " } split { source_field = "raw_message" delimiter = "The received data is:" fields = ["field1", "field2"] } json { source_field = "field2" result_table_name = "gps_test" } sql { sql = "select concat('',encrypt) as encrypt,`date` as up_date,concat('',lon) as lon,concat('',lat) as lat,concat('',vec1) as vec1,concat('',vec2) as vec2,concat('',vec3) as vec3,concat('',direction) as direction,concat('',altitude) as altitude,concat('',state) as state,concat('',alarm) as alarm,concat('',vehicleno) as vehicleno,concat('',vehiclecolor) as vehiclecolor,id,createBy as create_by,createDt as create_dt from gps_test where LENGTH(date) = 19" } # you can also you other filter plugins, such as sql # sql { # sql = "select * from accesslog where request_time > 1000" # } # If you would like to get more information about how to configure waterdrop and see full list of filter plugins, # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base } output { # choose stdout output plugin to output data to console # stdout { # } clickhouse { host = "hadoop-4:8123" clickhouse.socket_timeout = 50000 database = "wlpt_01" table = "t_plt_vehicle_location_test" fields = ["id","encrypt","up_date","lon","create_by","create_dt","lat","vec1","vec2","vec3","direction","altitude","state","alarm","vehicleno","vehiclecolor"] username = "default" password = "********" bulk_size = 5 retry = 3 } # you can also you other output plugins, such as sql # hdfs { # path = "hdfs://hadoop-cluster-01/nginx/accesslog_processed" # save_mode = "append" # } # If you would like to get more information about how to configure waterdrop and see full list of output plugins, # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base }
- thinking
- Log data format
2020-01-03 13:36:23,967 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 30] head -> 91 ; tail -> 93 2020-01-03 13:36:23,992 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 65] xxxxxxxxx: 4608 ; ########: 1111 2020-01-03 13:36:23,993 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 39] msg -> Message [msgLength=90, encryptFlag=0, msgGesscenterId=1111, encryptKey=52559, crcCode=14014, msgId=4608, msgSn=61937, msgBody=BigEndianHeapChannelBuffer(ridx=0, widx=64, cap=64), versionFlag=[0, 0, 1]] 2020-01-03 13:36:23,993 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.busihandler [busihandler. Java: 42] receive positioning data 2020-01-03 13:36:23,995 [INFO] [New I/O server worker #1-8] org.jboss.netty.handler.logging.LoggingHandler [JdkLogger.java : 58] [id: 0x1e24397b, /10.228.30.192:59894 => /10.228.30.215:8082] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=115, cap=115) - (HEXDUMP: 5b0000007306ce969412000000045701020f0000000000c9c2483132353538000000000000000000000000000212010000003d3530373334000000000000000000000000000000000000000000000000000000000000000000000000000000000000000030313339393134363431383920d35d) 2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 30] head -> 91 ; tail -> 93 2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 65] xxxxxxxxx: 4608 ; ########: 1111 2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 39] msg -> Message [msgLength=115, encryptFlag=0, msgGesscenterId=1111, encryptKey=0, crcCode=8403, msgId=4608, msgSn=114202260, msgBody=BigEndianHeapChannelBuffer(ridx=0, widx=89, cap=89), versionFlag=[1, 2, 15]] 2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.busihandler [busihandler. Java: 42] receive positioning data 2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] org.jboss.netty.handler.logging.LoggingHandler [JdkLogger.java : 58] [id: 0x1e24397b, /10.228.30.192:59894 => /10.228.30.215:8082] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=91, cap=91) - (HEXDUMP: 5b0000005a0206ce969512000000045701020f0000000000c9c2483132353538000000000000000000000000000212020000002400030107e40d24100690c4100208318f00000000000168e300b500030000100300000000a78f5d) 2020-01-03 13:36:23,997 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 30] head -> 91 ; tail -> 93 2020-01-03 13:36:23,997 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 65] xxxxxxxxx: 4608 ; ########: 1111 2020-01-03 13:36:23,997 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 39] msg -> Message [msgLength=90, encryptFlag=0, msgGesscenterId=1111, encryptKey=0, crcCode=42895, msgId=4608, msgSn=114202261, msgBody=BigEndianHeapChannelBuffer(ridx=0, widx=64, cap=64), versionFlag=[1, 2, 15]] 2020-01-03 13:36:23,997 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.busihandler [busihandler. Java: 42] receive positioning data 2020-01-03 13:36:23,997 [INFO] [Thread-36511178] com.xn.logistics.gps.server.BusiHandler [BusiHandler.java : 114] The received data is:{"encrypt":0,"date":"2020-01-03 13:36:16","lon":"1.1015067","lat":"3.409140","vec1":0,"vec2":0,"vec3":92387,"direction":181,"altitude":3,"state":4099,"alarm":0,"vehicleNo":"Shaanxi Province H12558","vehicleColor":2,"id":"MSG Shaanxi Province H12558","createBy":"UP_EXG_MSG_REAL_LOCATION","createDt":"2020-01-03 13:36:23"} 2020-01-03 13:36:24,006 [INFO] [New I/O server worker #1-8] org.jboss.netty.handler.logging.LoggingHandler [JdkLogger.java : 58] [id: 0x1e24397b, /10.228.30.192:59894 => /10.228.30.215:8082] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=115, cap=115) - (HEXDUMP: 5b00000073016a52f61200000004570100000000000000c9c2444235323136000000000000000000000000000212010000003d353039383900000000000000000000000000000000000000000000000000000000000000000000000000000000000000003031343239383831343430300b845d)
- Createdt ":" 2020-01-03 13:36:23 "}
- I need to get the json here, then parse it, and put it into the corresponding table fields of the database one by one according to the fields.
- The first consideration is regular filtering, filtering out the received data as the following json, and then json parsing and warehousing.
- When using select * from GPS raw_messagerlike '{. *} $' for regular filtering, an error is reported and the author Ricky Hoo is asked. He suggests that I use sql like
- Regular cannot continue to be used and cannot be obtained normally. Try to use like. Note that the usage is similar to like in mysql. Use raw message to let him receive data for each row:
sql { sql = "select * from gps where raw_message like '%The received data is:%' " }
- 99, "alarm": 0, "V Ehicleno ":" Shaan H12558","vehicleColor":2,"id":"MSG Shaan h12558 "," createby ":" up "ExG" MSG "real" location "," createdt ":" 2020-01-03 13:36:23 "}
- split is used for segmentation. The segmentation point is the received data: in this way, the data is finally divided into three parts, as follows. RM: 0, "vehicleno": "Shaan h12558", "vehiclecolor": 2, "Id":“ MSG Shaan h12558 "," createby ":" up "ExG" MSG "real" location "," createdt ":" 2020-01-03 13:36:23 "} the middle" | "
split { source_field = "raw_message" delimiter = "The received data is:" fields = ["field1", "field2"] }
- In this way, we have split json. In the previous split split, we have defined the corresponding fields according to the results. Please refer to the configuration file above. Next, parse the json. Using json
json { source_field = "field2" result_table_name = "gps_test" }
- Here, we have successfully parsed and got the result, but we need a new table to accept the previous processing results, so I rename the table in the result table name above
-
-
Run, view storage
- Finally, write the output for writing, check the clickhouse table, and check whether the writing is successful
At the beginning of debugging, use stdout output, so that the results of each step are presented in the log. Can be more convenient for debugging.
output { # choose stdout output plugin to output data to console # stdout { # } }
waterdrop document Reference link
Thanks to the author of waterdrop, the guidance and help in the use process, and thanks to Ricky Hoo.