Real-time reading directory files to HDFS cases
Case requirements: Use flume to listen for files in the entire directory
Implementation steps:
1. Create the configuration file "flume-dir-hdfsconf"
touch flume-dir-hdfs.conf vim flume-dir-hdfs.conf a3.sources = r3 a3.sinks = k3 a3.channels = c3 # Describe/configure the source a3.sources.r3.type = spooldir a3.sources.r3.spoolDir = /opt/module/flume/upload a3.sources.r3.fileSuffix = .COMPLETED a3.sources.r3.fileHeader = true #Ignore all files ending with. tmp and do not upload a3.sources.r3.ignorePattern = ([^ ]*\.tmp) # Describe the sink a3.sinks.k3.type = hdfs a3.sinks.k3.hdfs.path = hdfs://hadoop102:9000/flume/upload/%Y%m%d/%H #Prefix for uploading files a3.sinks.k3.hdfs.filePrefix = upload- #Whether to scroll folders according to time a3.sinks.k3.hdfs.round = true #How much time to create a new folder a3.sinks.k3.hdfs.roundValue = 1 #Redefining unit of time a3.sinks.k3.hdfs.roundUnit = hour #Whether to use local timestamp a3.sinks.k3.hdfs.useLocalTimeStamp = true #How many Event s are saved to flush to HDFS once a3.sinks.k3.hdfs.batchSize = 100 #Set file type to support compression a3.sinks.k3.hdfs.fileType = DataStream #How often to generate a new file a3.sinks.k3.hdfs.rollInterval = 600 #Setting the scroll size for each file is about 128M a3.sinks.k3.hdfs.rollSize = 134217700 #File scrolling is independent of the number of Event s a3.sinks.k3.hdfs.rollCount = 0 #Minimum Redundancy Number a3.sinks.k3.hdfs.minBlockReplicas = 1 # Use a channel which buffers events in memory a3.channels.c3.type = memory a3.channels.c3.capacity = 1000 a3.channels.c3.transactionCapacity = 100 # Bind the source and sink to the channel a3.sources.r3.channels = c3 a3.sinks.k3.channel = c3
2. Start the Monitor Folder Command
bin/flume-ng agent --conf conf/ --name a3 --conf-file job/flume-dir-hdfs.conf When using Spooling Directory Source, note that: (1) Do not create and continuously modify files in the monitoring directory (2) Uploaded files end with. COMPLETED (3) File changes of monitored folders are scanned every 500 milliseconds
3. add files to the upload folder
cd /root/app/flume mkdir upload touch hao.txt touch hao.tmp touch hao.log
4. View data on HDFS
"node01:50070"
5. Wait 1 s and query upload folder again
cd /root/app/flume/upload ll Those three documents appear