Installation and Use of Log Collection Framework Flume

Keywords: Hadoop hive Apache Zookeeper

Installation and Use of Log Collection Framework Flume

1. Introduction to Flume

1.1. Overview of Flume

Flume is a distributed, reliable and highly available system for collecting, transmitting and aggregating massive logs.
Flume can collect files, socket data packets and other forms of source data.
    It can also output the collected data to many external storage systems such as HDFS, hbase, hive, kafka and so on.
    The general acquisition requirement can be realized by simply configuring flume
 Flume also has a good ability to customize and expand for special scenarios. Therefore, flume can be applied to most daily constant data acquisition scenarios.

1.2. Operating mechanism

1. The core role of Flume distributed system is agent. Flume acquisition system is formed by connecting agents.
2. Each agent is equivalent to a data transferor, with three components:
a)Source: A collection source for docking with a data source to obtain data
 b)Sink: Subsidence, data collection, transmission, for data transfer to the next level of agent or to the final storage system
 c)Channel: A data transmission channel within an angent for transferring data from source to sink

1.3. Structural diagram of Flume acquisition system

1.3.1. Simple structure

Data acquisition by a single agent

1.3.2. Complex structure

Serial Connection between Multi-level Agents

2. Install Flume

2.1. Unzip Flume compressed file to specified directory

[root@node02 software]# tar -zxvf apache-flume-1.6.0-bin.tar.gz -C /opt/modules/

2.2. File renaming

[root@node02 modules]# mv apache-flume-1.6.0-bin flume-1.6.0
You have new mail in /var/spool/mail/root
[root@node02 modules]# ll
total 24
drwxr-xr-x.  9 matrix matrix 4096 Jan  7 13:44 elasticsearch-2.4.2
drwxr-xr-x.  7 root   root   4096 Jan 24 13:09 flume-1.6.0
drwxr-xr-x. 12 matrix matrix 4096 Jan 23 21:00 hadoop-2.5.1
drwxr-xr-x.  8 root   root   4096 Jan 23 18:43 hive-1.2.1
drwxr-xr-x.  3 matrix matrix 4096 Dec 19 16:01 journalnode
drwxr-xr-x. 12 matrix matrix 4096 Dec 17 21:20 zookeeper

2.3. Configuring Flume environment variables

[root@node02 ~]# ls -a
.   anaconda-ks.cfg  .bash_logout   .bashrc  .hivehistory  install.log.syslog      .mysql_history  .ssh     zookeeper.out
..  .bash_history    .bash_profile  .cshrc   install.log   jdk-7u79-linux-x64.rpm  .pki            .tcshrc
[root@node02 ~]# vi .bash_profile

export FLUME_HOME=/opt/modules/flume-1.6.0
export PATH=$PATH:$FLUME_HOME/bin

2.4. Make configuration effective

[root@node02 ~]# source .bash_profile

2.5. Collect files to HDFS

Gathering requirements: For example, business systems use log4j generated logs, and the content of logs is increasing.
The data appended to the log file need to be collected to hdfs in real time

According to the requirements, first define the following three elements
 Collection source - Monitoring File Content Update: exec'tail-F file'
Sinking target, sink - HDFS file system: hdfs sink
 Channel, the transfer channel between Source and sink, can be used either as file channel or as memory channel.

2.5.1. Configure Flume configuration file

[root@node02 flume-1.6.0]# vi conf/tail-hdfs.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#Exc refers to commands
# Describe/configure the source
a1.sources.r1.type = exec
#F pursues according to the file name, f pursues according to the nodeid of the file
a1.sources.r1.command = tail -F /home/hadoop/log/test.log
a1.sources.r1.channels = c1

# Describe the sink
#Sinking target
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
#Specify directories, flum helps with purpose replacement
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/
#Naming, prefix of file
a1.sinks.k1.hdfs.filePrefix = events-

#Change the catalogue in 10 minutes
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

#Waiting time (seconds) before file scroll
a1.sinks.k1.hdfs.rollInterval = 3

#The size limit for file scrolling (bytes)
a1.sinks.k1.hdfs.rollSize = 500

#Scroll files (number of events) after how many event data are written
a1.sinks.k1.hdfs.rollCount = 20

#Five events are written into it.
a1.sinks.k1.hdfs.batchSize = 5

#Format directories with local time
a1.sinks.k1.hdfs.useLocalTimeStamp = true

#After sinking, the file type generated is Sequencefile by default, and DataStream by default is plain text.
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

2.6. Write data to a specified file

[root@node02 flume-1.6.0]# mkdir -p /home/hadoop/log
[root@node02 flume-1.6.0]# touch /home/hadoop/log/test.log
[root@node02 ~]# while true
> do
> echo 11111111111 >> /home/hadoop/log/test.log
> sleep 0.6
> done
while true
do
echo 11111111111 >> /home/hadoop/log/test.log
sleep 0.6
done

[root@node02 flume-1.6.0]# tail -f /home/hadoop/log/test.log

2.7. Start Flume Log Collection

Note: Check whether Hadoop HDFS is started, and if not, start
[root@node02 flume-1.6.0]# ./bin/flume-ng agent -c conf -f conf/tail-hdfs.conf -n a1

2.8. View the directories created by Flume on HDFS through Hadoop Web UI

[root@node02 hadoop-2.5.1]# ./bin/hdfs dfs -ls -R /flume
drwxr-xr-x   - root supergroup          0 2017-01-24 13:36 /flume/events
drwxr-xr-x   - root supergroup          0 2017-01-24 13:36 /flume/events/17-01-24
drwxr-xr-x   - root supergroup          0 2017-01-24 13:38 /flume/events/17-01-24/1330
-rw-r--r--   3 root supergroup        140 2017-01-24 13:36 /flume/events/17-01-24/1330/events-.1485236169660
-rw-r--r--   3 root supergroup        140 2017-01-24 13:36 /flume/events/17-01-24/1330/events-.1485236169661
-rw-r--r--   3 root supergroup         70 2017-01-24 13:36 /flume/events/17-01-24/1330/events-.1485236169662
-rw-r--r--   3 root supergroup         77 2017-01-24 13:36 /flume/events/17-01-24/1330/events-.1485236189653
-rw-r--r--   3 root supergroup         77 2017-01-24 13:36 /flume/events/17-01-24/1330/events-.1485236195683
-rw-r--r--   3 root supergroup         77 2017-01-24 13:36 /flume/events/17-01-24/1330/events-.1485236201641
-rw-r--r--   3 root supergroup         77 2017-01-24 13:36 /flume/events/17-01-24/1330/events-.1485236207790
-rw-r--r--   3 root supergroup         84 2017-01-24 13:36 /flume/events/17-01-24/1330/events-.1485236213809
-rw-r--r--   3 root supergroup         70 2017-01-24 13:37 /flume/events/17-01-24/1330/events-.1485236219808
-rw-r--r--   3 root supergroup         84 2017-01-24 13:37 /flume/events/17-01-24/1330/events-.1485236225867
-rw-r--r--   3 root supergroup         77 2017-01-24 13:37 /flume/events/17-01-24/1330/events-.1485236231852
-rw-r--r--   3 root supergroup         70 2017-01-24 13:37 /flume/events/17-01-24/1330/events-.1485236238116
-rw-r--r--   3 root supergroup         84 2017-01-24 13:37 /flume/events/17-01-24/1330/events-.1485236244133
-rw-r--r--   3 root supergroup         63 2017-01-24 13:37 /flume/events/17-01-24/1330/events-.1485236250160
-rw-r--r--   3 root supergroup         56 2017-01-24 13:37 /flume/events/17-01-24/1330/events-.1485236254744
-rw-r--r--   3 root supergroup         42 2017-01-24 13:37 /flume/events/17-01-24/1330/events-.1485236260456
-rw-r--r--   3 root supergroup         35 2017-01-24 13:37 /flume/events/17-01-24/1330/events-.1485236264210
-rw-r--r--   3 root supergroup         35 2017-01-24 13:37 /flume/events/17-01-24/1330/events-.1485236267832
-rw-r--r--   3 root supergroup         49 2017-01-24 13:37 /flume/events/17-01-24/1330/events-.1485236271410
-rw-r--r--   3 root supergroup         84 2017-01-24 13:38 /flume/events/17-01-24/1330/events-.1485236275630
-rw-r--r--   3 root supergroup         77 2017-01-24 13:38 /flume/events/17-01-24/1330/events-.1485236281581
-rw-r--r--   3 root supergroup         77 2017-01-24 13:38 /flume/events/17-01-24/1330/events-.1485236287587
-rw-r--r--   3 root supergroup         84 2017-01-24 13:38 /flume/events/17-01-24/1330/events-.1485236293646
-rw-r--r--   3 root supergroup         70 2017-01-24 13:38 /flume/events/17-01-24/1330/events-.1485236299642
-rw-r--r--   3 root supergroup         49 2017-01-24 13:38 /flume/events/17-01-24/1330/events-.1485236305888
-rw-r--r--   3 root supergroup         70 2017-01-24 13:38 /flume/events/17-01-24/1330/events-.1485236311177
-rw-r--r--   3 root supergroup         35 2017-01-24 13:38 /flume/events/17-01-24/1330/events-.1485236315362
-rw-r--r--   3 root supergroup         77 2017-01-24 13:38 /flume/events/17-01-24/1330/events-.1485236320019
-rw-r--r--   3 root supergroup         77 2017-01-24 13:38 /flume/events/17-01-24/1330/events-.1485236324629
-rw-r--r--   3 root supergroup         42 2017-01-24 13:38 /flume/events/17-01-24/1330/events-.1485236330636

3.Flume Multiple agent Connections

Get data from tail command and send it to Avro port (tail - > avro)
Another node can configure an Avro source to relay data and send external storage (avro - > log)

3.1. Install Flume on node03

3.2. Configure Flume on node02

Get data from tail command and send it to avro port
[root@node02 flume-1.6.0]# vi conf/tail-avro.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/log/test.log
a1.sources.r1.channels = c1

# Describe the sink
#The binding is not the local machine, but the service address of another machine. The avro in sink is a sender, and the avro client is sent to the node 03 machine.
a1.sinks = k1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = 192.168.230.12
a1.sinks.k1.port = 4141
a1.sinks.k1.batch-size = 2

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3.2.1. Run Flume on node02 and send data to node03

[root@node02 flume-1.6.0]# ./bin/flume-ng agent -c conf -f conf/tail-avro.conf -n a1

3.3. Configure Flume on node03

Configure an Avro source to relay data and send external storage / HDFS (avro - > log)

[root@node03 flume-1.6.0]# vi conf/avro-hdfs.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
#The avro component in source is the recipient service that binds the local machine
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3.3.1. Run Flume on node03 to receive data from node02

[root@node03 flume-1.6.0]# ./bin/flume-ng agent -c conf -f conf/avro-hdfs.conf -n a1 -Dflume.root.logger=INFO,console

You can see that port 4141 is monitored
[root@node03 ~]# netstat -nltp

3.3. Sending data

Posted by MichaelR on Thu, 04 Apr 2019 20:12:30 -0700