Big data Flume custom type

Keywords: Scala Big Data Spark

1. Customize the Interceptor

1.1 case requirements

When Flume is used to collect the local logs of the server, different types of logs need to be sent to different analysis systems according to different log types.

1.2 demand analysis: Interceptor and Multiplexing ChannelSelector cases

In actual development, there may be many types of logs generated by a server, and different types of logs may need to be sent to different analysis systems. At this time, the Multiplexing structure in Flume topology will be used. The principle of Multiplexing is to send different events to different channels according to the value of a key in the Header of the event. Therefore, we need to customize an Interceptor to assign keys to the headers of different types of events
Different values.
In this case, we simulate logs with port data to simulate whether "bigdata" is included in different types of logs. We need to customize the interceptor to distinguish whether "bigdata" is included in the data and send it to different analysis systems (channels).

1.3 implementation steps

(1) Create a maven project and introduce the following dependencies.

<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>1.9.0</version>
</dependency>

(2) Define the CustomInterceptor class and implement the Interceptor interface.

package com.bigdata.interceptor;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

public class TypeInterceptor implements Interceptor {
    //Declare a collection of events
    private List<Event> addHeaderEvents;

    @Override
    public void initialize() {
//Initializes a collection of stored events
        addHeaderEvents = new ArrayList<>();
    }

    //Single event interception
    @Override
    public Event intercept(Event event) {
//1. Get the header information in the event
        Map<String, String> headers = event.getHeaders();
//2. Get the body information in the event
        String body = new String(event.getBody());
//3. Determine how to add header information according to whether there is "bigdata" in the body
        if (body.contains("bigdata")) {
//4. Add header information
            headers.put("type", "first");
        } else {
//4. Add header information
            headers.put("type", "second");
        }
        return event;
    }

    //Batch event interception
    @Override
    public List<Event> intercept(List<Event> events) {
//1. Empty the collection
        addHeaderEvents.clear();
//2. Traverse events
        for (Event event : events) {
//3. Add header information to each event
            addHeaderEvents.add(intercept(event));
        }
//4. Return results
        return addHeaderEvents;
    }

    @Override
    public void close() {
    }

    public static class Builder implements Interceptor.Builder {
        @Override
        public Interceptor build() {
            return new TypeInterceptor();
        }

        @Override
        public void configure(Context context) {
        }
    }
}

(3) Edit flume profile
Configure Flume1 on Hadoop 102 with 1 netcat source, 1 sink group (2 avro sink),
And configure the corresponding ChannelSelector and interceptor.

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type =
com.bigdata.flume.interceptor.CustomInterceptor$Builder
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.first = c1
a1.sources.r1.selector.mapping.second = c2
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4141
a1.sinks.k2.type=avro
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 4242
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Use a channel which buffers events in memory
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

Configure an avro source and a logger sink for Flume4 on Hadoop 103.

a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop103
a1.sources.r1.port = 4141
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1

Configure an avro source and a logger sink for Flume3 on Hadoop 104.

a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop104
a1.sources.r1.port = 4242
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1

(4) Start the flume process on Hadoop 102, Hadoop 103 and Hadoop 104 respectively, and pay attention to the sequence.
(5) In Hadoop 102, use netcat to send letters and numbers to localhost:44444.
(6) Observe the logs printed by Hadoop 103 and Hadoop 104.

2 custom Source

2.1 introduction

Source is the component responsible for receiving data to Flume Agent. The source component can handle various types and formats of log data, including avro, thrift, exec, jms, spooling directory, netcat, sequence generator, syslog, http and legacy. There are many official source types, but sometimes they can't
To meet the needs of actual development, we need to customize some source s according to the actual needs.
The official also provides a custom source interface:
https://flume.apache.org/FlumeDeveloperGuide.html#source Customize according to official instructions
MySource needs to inherit AbstractSource class and implement Configurable and PollableSource interfaces.
Implement corresponding methods:
getBackOffSleepIncrement() //backoff step getmaxbackoffsleepinterval() / / maximum backoff time
Configure (context) / / initialize the context (read the content of the configuration file)
process() / / get the data, package it into an event and write it to channel. This method will be called circularly.
Usage scenario: read MySQL data or other file systems.

2.2 requirements: user defined Source

Flume is used to receive data, prefix each data and output it to the console. The prefix can be configured from the flume configuration file.

2.3 analysis: user defined Source

2.4 coding

(1) Import pom dependencies

<dependencies>
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>1.9.0</version>
</dependency>

(2) Write code

package com.bigdata;
import org.apache.flume.Context;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.PollableSource;
import org.apache.flume.conf.Configurable;
import org.apache.flume.event.SimpleEvent;
import org.apache.flume.source.AbstractSource;
import java.util.HashMap;
public class MySource extends AbstractSource implements
        Configurable, PollableSource {
    //Define the fields that the configuration file will read in the future
    private Long delay;
    private String field;
    //Initialize configuration information
    @Override
    public void configure(Context context) {
        delay = context.getLong("delay");
        field = context.getString("field", "Hello!");
    }
    @Override
    public Status process() throws EventDeliveryException {
        try {
//Create event header information
            HashMap<String, String> hearderMap = new HashMap<>();
//Create event
            SimpleEvent event = new SimpleEvent();
//Loop encapsulation event
            for (int i = 0; i < 5; i++) {
//Set header information for events
                event.setHeaders(hearderMap);
//Set content for event
                event.setBody((field + i).getBytes());
//Write events to channel
                getChannelProcessor().processEvent(event);
                Thread.sleep(delay);
            }
        } catch (Exception e) {
            e.printStackTrace();
            return Status.BACKOFF;
        }
        return Status.READY;
    }
    @Override
    public long getBackOffSleepIncrement() {
        return 0;
    }
    @Override
    public long getMaxBackOffSleepInterval() {
        return 0;
    }
}

2.5 testing

(1) Pack
Package the written code and put it in the lib directory of flume (/ opt/module/flume).
(2) Configuration file

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = com.bigdata.MySource
a1.sources.r1.delay = 1000
#a1.sources.r1.field = bigdata
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(3) Open task

[bigdata@hadoop102 flume]$ pwd
/opt/module/flume
[bigdata@hadoop102 flume]$ bin/flume-ng agent -c conf/ -f
job/mysource.conf -n a1 -Dflume.root.logger=INFO,console

(4) Result display

3. Customize Sink

3.1 introduction

Sink constantly polls events in the Channel and removes them in batches, and writes these events in batches to the storage or indexing system, or is sent to another Flume Agent.
Sink is completely transactional. Before bulk deleting data from the Channel, each sink starts a transaction with the Channel. Once batch events are successfully written out to the storage system or the next Flume Agent, sink will commit transactions using the Channel. Once the transaction is committed, the Channel deletes the event from its own internal buffer.
Sink component destinations include hdfs, logger, avro, thrift, ipc, file, null, HBase, solr, and custom. There are many types of sink officially provided, but sometimes they can not meet the needs of actual development. At this time, we need to customize some sink according to the actual needs.
The official also provides a user-defined sink interface:
https://flume.apache.org/FlumeDeveloperGuide.html#sink According to the official instructions, customizing MySink needs to inherit the AbstractSink class and implement the Configurable interface.
Implement corresponding methods:
Configure (context) / / initialize the context (read the content of the configuration file)
process() / / read and obtain data (event) from the Channel. This method will be called circularly.
Usage scenario: read Channel data and write it to MySQL or other file systems.

3.2 requirements

Flume is used to receive data, add prefixes and suffixes to each data at the Sink end, and output to the console. The prefix and suffix can be configured in the flume task configuration file.
Process analysis:

3.3 coding

package com.bigdata;

import org.apache.flume.*;
import org.apache.flume.conf.Configurable;
import org.apache.flume.sink.AbstractSink;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class MySink extends AbstractSink implements Configurable {
    //Create Logger object
    private static final Logger LOG =
            LoggerFactory.getLogger(AbstractSink.class);
    private String prefix;
    private String suffix;

    @Override
    public Status process() throws EventDeliveryException {
//Declare return value status information
        Status status;
//Gets the Channel bound by the current Sink
        Channel ch = getChannel();
//Get transaction
        Transaction txn = ch.getTransaction();
//Declare event
        Event event;
//Open transaction
        txn.begin();
//Read the events in the Channel until the end of the loop is read
        while (true) {
            event = ch.take();
            if (event != null) {
                break;
            }
        }
        try {
//Handling events (printing)
            LOG.info(prefix + new String(event.getBody()) +
                    suffix);
//Transaction commit
            txn.commit();
            status = Status.READY;
        } catch (Exception e) {
//Exception encountered, transaction rolled back
            txn.rollback();
            status = Status.BACKOFF;
        } finally {
//Close transaction
            txn.close();
        }
        return status;
    }

    @Override
    public void configure(Context context) {
//Read the contents of the configuration file, with default values
        prefix = context.getString("prefix", "hello:");
//Read the contents of the configuration file without default value
        suffix = context.getString("suffix");
    }
}

3.4 testing

(1) Pack
Package the written code and put it in the lib directory of flume (/ opt/module/flume).
(2) Configuration file

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = com.bigdata.MySink
#a1.sinks.k1.prefix = bigdata:
a1.sinks.k1.suffix = :bigdata
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(3) Open task

[bigdata@hadoop102 flume]$ bin/flume-ng agent -c conf/ -f
job/mysink.conf -n a1 -Dflume.root.logger=INFO,console
[bigdata@hadoop102 ~]$ nc localhost 44444
hello
OK
bigdata
OK

(4) Result display

Posted by nainil on Fri, 26 Nov 2021 08:12:12 -0800