Big data - - Flume

Keywords: SQL MySQL log4j Java

Flume Log Collection System

Summary

Flume is a highly available, reliable and distributed software for collecting, aggregating and transferring massive logs provided by Cloudera.

The core of Flume is to collect data from a data source and then send the collected data to a designated destination (sink). In order to ensure the success of the transport process, before sending to the destination (sink), the data will be cached first. After the data really reaches the destination (sink), flume deletes its cached data.

Flume supports customizing all kinds of data senders to collect all types of data, while Flume supports customizing all kinds of data receivers to ultimately store data. The general acquisition requirement can be realized by simple configuration of flume. For special scenarios, it also has a good ability to customize and expand. Therefore, flume can be applied to most daily constant data acquisition scenarios.

Flume currently has two versions. Flume 0.9X version is called Flume OG (original generation), and Flume 1.X version is called Flume NG (next generation). Flume NG is very different from Flume OG because of its core components, core configuration and code architecture refactoring. Another reason for the change was that Flume was incorporated into apache, and Cloudera Flume was renamed Apache Flume.


Official website: http://flume.apache.org/

operating mechanism

The core role of Flume system is agent, which is a Java process and runs on the log collection node.

Each agent is equivalent to a data transferor with three components: Source,Channel,Sink

Source: A collection source for docking with data sources to obtain data, including avro, thrift, exec, jms, spooling directory, netcat, syslog, http, legacy

Note:

Avro: A subproject of Apache

Thrift: An RPC Framework for Facebook Open Source

Sink: Sink: Sink: Sink is sinking. The purpose of data collection is to transfer data to the next agent or to the final storage system. Sink component destinations include hdfs, logger, avro, thrift, ipc, file, null, HBase, solr, customization.

Channel: A data transmission channel within an agent for transferring data from source to sink.

In the whole process of data transmission, event flows, which is the most basic unit of data transmission in Flume. Evet encapsulates the transmitted data. If it's a text file, it's usually a row record, and event is also the basic unit of a transaction. Event is a byte array from source, to channel, to sink, and can carry headers information. Event represents the smallest complete unit of data, coming from an external data source to an external destination.

A complete event includes event headers, event body, event information, in which event information is the diary records collected by flume.

Flume comes with two Channels: Memory Channel and File Channel.

Memory Channel is a queue in memory. Memory Channel works in situations where you don't need to care about data loss. If relational data needs to be lost, Memory Channel should not be used because program death, machine downtime, or restart can result in data loss.

File Channel writes all events to disk. Therefore, data will not be lost when the program is shut down or the machine is down.

Architecture of Flume Acquisition System

  1. Simple structure
    Data acquisition by a single agent

    Multiple in series

  2. Complex structure


Flume installation deployment

  1. Upload the installation package to the node where the data source is located and decompress tar-zxvf apache-flume-1.8.0-bin.tar.gz.
  2. Then enter the flume directory, modify flume-env.sh under conf, and configure JAVA_HOME inside.
  3. Configure the acquisition scheme according to the data acquisition requirement and describe it in the configuration file (the file name can be customized arbitrarily)
  4. Specify the configuration file of the acquisition scheme and start the flume agent on the corresponding node.

A Simple Case of Flume

  • Receive the netcat port data and print it on the console.

    1. First create a new file in the conf directory of flume
      vim netcat-logger.conf

      # Define the names of components in this agent 
      a1.sources = r1 
      a1.sinks = k1 
      a1.channels = c1 
      # Describe and configure source component r1
      a1.sources.r1.type = netcat 
      a1.sources.r1.bind = localhost 
      a1.sources.r1.port = 44444 
      # Describe and configure sink components: k1 
      a1.sinks.k1.type = logger 
      # Describes and configures channel components, which are used here as memory caches 
      a1.channels.c1.type = memory 
      a1.channels.c1.capacity = 1000 
      a1.channels.c1.transactionCapacity = 100
      # Describe and configure the connection between source channel sink 
      a1.sources.r1.channels = c1 
      a1.sinks.k1.channel = c1
      
    2. Start agent to collect data:

      bin/flume-ng agent -c conf -f conf/netcat-logger.conf -n a1 -	Dflume.root.logger=INFO,console
      

      - c conf specifies the directory where flume's own configuration files are located
      - f con f/netcat-logger.con specifies the acquisition scheme we describe
      - n a1 specifies the name of our agent

    3. test

      The first step is to send data to the port where the agent collects and listens, so that the agent has data to adopt. On any machine that can connect with agent nodes:

            nc localhost 44444

  • Collect directories to HDFS
    Collection requirements:

    Under a specific directory of the server, new files will be generated continuously. Whenever new files appear, it is necessary to collect the files into HDFS.

    Define three key elements according to needs

    1. Source is the source of collection - the directory of monitoring files: spooldir.
    2. The sink target is sink - HDFS file system: hdfs sink.
    3. Channel, the transfer channel between source and sink, can be used either as file channel or as memory channel.

    Configuration File Writing
    vim dirToHDFS.conf

    # Name the components on this agent 
    a1.sources = r1 
    a1.sinks = k1
    a1.channels = c1
    # Describe/configure the source
     ##Note: The same name file cannot be duplicated in the monitoring object 
    a1.sources.r1.type = spooldir 
    a1.sources.r1.spoolDir = /root/logs 
    a1.sources.r1.fileHeader = true
    # Describe the sink 
    a1.sinks.k1.type = hdfs 
    a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/ a1.sinks.k1.hdfs.filePrefix = events
    a1.sinks.k1.hdfs.round = true 
    a1.sinks.k1.hdfs.roundValue = 10 
    a1.sinks.k1.hdfs.roundUnit = minute
    a1.sinks.k1.hdfs.rollInterval = 3 
    a1.sinks.k1.hdfs.rollSize = 20 
    a1.sinks.k1.hdfs.rollCount = 5
    a1.sinks.k1.hdfs.batchSize = 1 
    a1.sinks.k1.hdfs.useLocalTimeStamp = true
    #The generated file type, which defaults to Sequencefile available DataStream, is plain text a1.sinks.k1.hdfs.fileType = DataStream 
    
    # Use a channel which buffers events in memory 
    a1.channels.c1.type = memory 
    a1.channels.c1.capacity = 1000 
    a1.channels.c1.transactionCapacity = 100
    
    # Bind the source and sink to the channel 
    a1.sources.r1.channels = c1
    a1.sinks.k1.channel = c1
    

    Channel parameter interpretation:
    capacity: Default the maximum number of event s that can be stored in this channel
    Trasaction Capacity: Maximum number of event s that can be obtained from source or sent to sink at a time
    Start flume

    bin/flume-ng agent -c conf -f conf/dirToHDFS.conf  -n a1 -Dflume.root.logger=INFO,console
    
  • Collect files to HDFS
    Collection requirements:

    For example, business systems use log4j to generate logs, and the content of logs is increasing. It is necessary to collect the data appended to the log files into hdfs in real time.

    Define three key elements according to needs

    1. Collection source - Monitoring File Content Update: exec'tail-F file'
    2. Sinking target, sink-HDFS file system: hdfs sink
    3. Channel, the transfer channel between source and sink, can be used either as file channel or as memory channel.

    Configuration File Writing

    # Name the components on this agent
    a1.sources = r1
    a1.sinks = k1 
    a1.channels = c1
    # Describe/configure the source
    a1.sources.r1.type = exec 
    a1.sources.r1.command = tail -F /root/logs/test.log a1.sources.r1.channels = c1
    # Describe the sink
    a1.sinks.k1.type = hdfs
    a1.sinks.k1.hdfs.path = /flume/tailout/%y-%m-%d/%H%M/ a1.sinks.k1.hdfs.filePrefix = events
    a1.sinks.k1.hdfs.round = true 
    a1.sinks.k1.hdfs.roundValue = 10 
    a1.sinks.k1.hdfs.roundUnit = minute 
    a1.sinks.k1.hdfs.rollInterval = 3 
    a1.sinks.k1.hdfs.rollSize = 20 
    a1.sinks.k1.hdfs.rollCount = 5
    a1.sinks.k1.hdfs.batchSize = 1 
    a1.sinks.k1.hdfs.useLocalTimeStamp = true
    
    #Generated file type, default is Sequencefile, available DataStream, is plain text
    a1.sinks.k1.hdfs.fileType = DataStream
    # Use a channel which buffers events in memory
    a1.channels.c1.type = memory 
    a1.channels.c1.capacity = 1000 
    a1.channels.c1.transactionCapacity = 100
    
    
    # Bind the source and sink to the channel
    a1.sources.r1.channels = c1
    a1.sinks.k1.channel = c1
    

Flume custom MySQL Source

Custom Source Description

Source is the component responsible for receiving data to Flume Agent. Source components can handle various types and formats of log data, including avro, thrift, exec, jms, spooling directory, netcat, sequence generator, syslog, http, legacy. There are many types of source provided by the government, but sometimes they can not meet the actual development needs. At this time, we need to customize some Sources according to the actual needs.

For example: real-time monitoring MySQL, data transmission from MySQL to HDFS or other storage framework, so we need to implement MySQL Source ourselves.

Officials also provide a custom source interface:

      Description: https://flume.apache.org/Flume Developer Guide.html#source

Custom MySQL Source Composition

Customize MySQL Source steps

Customizing MySqlSource according to official instructions requires inheriting the AbstractSource class and implementing Configurable and SollableSource interfaces. To achieve the corresponding method:

getBackOffSleepIncrement()// Not yet available
getMaxBackOffSleepInterval()// Not yet available
configure(Context context) // initialize context

Procedure ()// Getting data (Getting data from MySql is a complicated business process, so we define a special class, SQL SourceHelper, to handle the interaction with MySql), encapsulate it as Event and write it to Channel, which is called circularly.

stop()// Close related resources

code implementation

Importing POM dependencies

<dependencies>
    <dependency>
        <groupId>org.apache.flume</groupId>
        <artifactId>flume-ng-core</artifactId>
        <version>1.7.0</version>
    </dependency>
    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
        <version>5.1.27</version>
    </dependency>
</dependencies>

Add configuration information

Add jdbc.properties and log4j. properties under ClassPath
jdbc.properties:

dbDriver=com.mysql.jdbc.Driver
dbUrl=jdbc:mysql://hadoop102:3306/mysqlsource?useUnicode=true&characterEncoding=utf-8
dbUser=root
dbPassword=000000
log4j. properties:
#--------console-----------
log4j.rootLogger=info,myconsole,myfile
log4j.appender.myconsole=org.apache.log4j.ConsoleAppender
log4j.appender.myconsole.layout=org.apache.log4j.SimpleLayout
#log4j.appender.myconsole.layout.ConversionPattern =%d [%t] %-5p 	[%c] - %m%n

#log4j.rootLogger=error,myfile
log4j.appender.myfile=org.apache.log4j.DailyRollingFileAppender
log4j.appender.myfile.File=/tmp/flume.log
log4j.appender.myfile.layout=org.apache.log4j.PatternLayout
log4j.appender.myfile.layout.ConversionPattern =%d [%t] %-5p [%c] - %m%n

SQLSourceHelper

  1. Attribute description:

    attribute Description (default in parentheses)

    |
    RunQuery Delay | Query Interval (10000)|
    | batchSize | cache size (100)|
    | startFrom | Query statement start id (0)|
    | CurrtIndex | Query statement current id, need to look up metadata table before each query|
    | recordSixe | Number of query returns|
    | table | monitored table name|
    | columnsToSelect | Query field (*)|
    | CusmQuery | User-passed Query Statement|
    | query | query statement|
    | defaultCharsetResultSet | Encoding Format (UTF-8)|

  2. Method Description:

    Method Explain
    SQLSourceHelper(Context context) Constructing Method, Initializing Attributes and Obtaining JDBC Connections
    InitConnection(String url, String user, String pw) Get the JDBC connection
    checkMandatoryProperties() Verify whether the relevant properties are set (additions can be made in actual development)
    buildQuery() Construct sql statement according to actual situation, return value String
    executeQuery() Execute query operation of sql statement, return value List < List >
    getAllRows(List<List> queryResult) Converting the query results to String facilitates subsequent operations
    updateOffset2DB(int size) Write offset to the metadata table based on the results of each query
    execSql(String sql) Executing sql statement method concretely
    getStatusDBIndex(int startFrom) Getting offset in metadata table
    queryOne(String sql) Method of obtaining offset actual sql statement execution in metadata table
    close() close resource
  3. code analysis

  4. code implementation

    import org.apache.flume.Context;
    import org.apache.flume.conf.ConfigurationException;
    import org.slf4j.Logger;
    import org.slf4j.LoggerFactory;
    
    import java.io.IOException;
    import java.sql.*;
    import java.text.ParseException;
    import java.util.ArrayList;
    import java.util.List;
    import java.util.Properties;
    
    public class SQLSourceHelper {
    
        private static final Logger LOG = LoggerFactory.getLogger(SQLSourceHelper.class);
    
        private int runQueryDelay, //Time interval between two queries
                startFrom,            //Start id
                currentIndex,	        //Current id
                recordSixe = 0,      //Number of returned results per query
                maxRow;                //Maximum number of bars per query
    
        private String table,       //Tables to be operated on
                columnsToSelect,     //Columns of user-incoming queries
                customQuery,          //Input Query Statement by User
                query,                 //Constructed Query Statement
                defaultCharsetResultSet;//Coding set
    
        //Context, used to retrieve configuration files
        private Context context;
    
        //Assignment of a defined variable (default value) can be modified in the configuration file of the flume task
        private static final int DEFAULT_QUERY_DELAY = 10000;
        private static final int DEFAULT_START_VALUE = 0;
        private static final int DEFAULT_MAX_ROWS = 2000;
        private static final String DEFAULT_COLUMNS_SELECT = "*";
        private static final String DEFAULT_CHARSET_RESULTSET = "UTF-8";
    
        private static Connection conn = null;
        private static PreparedStatement ps = null;
        private static String connectionURL, connectionUserName, connectionPassword;
    
        //Loading static resources
    static {
    
            Properties p = new Properties();
    
            try {
                p.load(SQLSourceHelper.class.getClassLoader().getResourceAsStream("jdbc.properties"));
                connectionURL = p.getProperty("dbUrl");
                connectionUserName = p.getProperty("dbUser");
                connectionPassword = p.getProperty("dbPassword");
                Class.forName(p.getProperty("dbDriver"));
    
            } catch (IOException | ClassNotFoundException e) {
                LOG.error(e.toString());
            }
        }
    
        //Get the JDBC connection
        private static Connection InitConnection(String url, String user, String pw) {
            try {
    
                Connection conn = DriverManager.getConnection(url, user, pw);
    
                if (conn == null)
                    throw new SQLException();
    
                return conn;
    
            } catch (SQLException e) {
                e.printStackTrace();
            }
    
            return null;
        }
    
        //Construction method
    SQLSourceHelper(Context context) throws ParseException {
    
            //Initialization context
            this.context = context;
    
            //Default parameters: Get the parameters in the flume task configuration file and use the default values if you can't read them.
            this.columnsToSelect = context.getString("columns.to.select", DEFAULT_COLUMNS_SELECT);
    
            this.runQueryDelay = context.getInteger("run.query.delay", DEFAULT_QUERY_DELAY);
    
            this.startFrom = context.getInteger("start.from", DEFAULT_START_VALUE);
    
            this.defaultCharsetResultSet = context.getString("default.charset.resultset", DEFAULT_CHARSET_RESULTSET);
    
            //No default parameters: Get the parameters in the flume task configuration file
            this.table = context.getString("table");
            this.customQuery = context.getString("custom.query");
    
            connectionURL = context.getString("connection.url");
    
            connectionUserName = context.getString("connection.user");
    
            connectionPassword = context.getString("connection.password");
    
            conn = InitConnection(connectionURL, connectionUserName, connectionPassword);
    
            //Check the configuration information and throw an exception if there is no default parameter and no assignment
            checkMandatoryProperties();
    
            //Get the current id
            currentIndex = getStatusDBIndex(startFrom);
    
            //Building Query Statements
            query = buildQuery();
        }
    
        //Check the configuration information (tables, query statements, and database connection parameters)
    private void checkMandatoryProperties() {
    
            if (table == null) {
                throw new ConfigurationException("property table not set");
            }
    
            if (connectionURL == null) {
                throw new ConfigurationException("connection.url property not set");
            }
    
            if (connectionUserName == null) {
                throw new ConfigurationException("connection.user property not set");
            }
    
            if (connectionPassword == null) {
                throw new ConfigurationException("connection.password property not set");
            }
        }
    
        //Constructing sql statements
    private String buildQuery() {
    
            String sql = "";
    
            //Get the current id
            currentIndex = getStatusDBIndex(startFrom);
            LOG.info(currentIndex + "");
    
            if (customQuery == null) {
                sql = "SELECT " + columnsToSelect + " FROM " + table;
            } else {
                sql = customQuery;
            }
    
            StringBuilder execSql = new StringBuilder(sql);
    
            //Using id as offset
            if (!sql.contains("where")) {
                execSql.append(" where ");
                execSql.append("id").append(">").append(currentIndex);
    
                return execSql.toString();
            } else {
                int length = execSql.toString().length();
    
                return execSql.toString().substring(0, length - String.valueOf(currentIndex).length()) + currentIndex;
            }
        }
    
        //Execute queries
    List<List<Object>> executeQuery() {
    
            try {
                //Every time a query is executed, sql is regenerated because the id is different
                customQuery = buildQuery();
    
                //A collection of results
                List<List<Object>> results = new ArrayList<>();
    
                if (ps == null) {
                    //
                    ps = conn.prepareStatement(customQuery);
                }
    
                ResultSet result = ps.executeQuery(customQuery);
    
                while (result.next()) {
    
                    //A collection of data (multiple columns)
                    List<Object> row = new ArrayList<>();
    
                    //Put the returned results into the collection
                    for (int i = 1; i <= result.getMetaData().getColumnCount(); i++) {
                        row.add(result.getObject(i));
                    }
    
                    results.add(row);
                }
    
                LOG.info("execSql:" + customQuery + "\nresultSize:" + results.size());
    
                return results;
            } catch (SQLException e) {
                LOG.error(e.toString());
    
                // Reconnection
                conn = InitConnection(connectionURL, connectionUserName, connectionPassword);
    
            }
    
            return null;
        }
    
        //Convert the result set to a string, each data is a list set, and each small list set is converted to a string.
    List<String> getAllRows(List<List<Object>> queryResult) {
    
            List<String> allRows = new ArrayList<>();
    
            if (queryResult == null || queryResult.isEmpty())
                return allRows;
    
            StringBuilder row = new StringBuilder();
    
            for (List<Object> rawRow : queryResult) {
    
                Object value = null;
    
                for (Object aRawRow : rawRow) {
    
                    value = aRawRow;
    
                    if (value == null) {
                        row.append(",");
                    } else {
                        row.append(aRawRow.toString()).append(",");
                    }
                }
    
                allRows.add(row.toString());
                row = new StringBuilder();
            }
    
            return allRows;
        }
    
        //Update the offset metadata status and call it every time the result set is returned. The offset value of each query must be recorded for intermittent running data in the program, with id as offset
        void updateOffset2DB(int size) {
            //Use source_tab as KEY, insert if it does not exist, and update if it does (each source table corresponds to a record)
            String sql = "insert into flume_meta(source_tab,currentIndex) VALUES('"
                    + this.table
                    + "','" + (recordSixe += size)
                    + "') on DUPLICATE key update source_tab=values(source_tab),currentIndex=values(currentIndex)";
    
            LOG.info("updateStatus Sql:" + sql);
    
            execSql(sql);
        }
    
        //Execute sql statements
    private void execSql(String sql) {
    
            try {
                ps = conn.prepareStatement(sql);
    
                LOG.info("exec::" + sql);
    
                ps.execute();
            } catch (SQLException e) {
                e.printStackTrace();
            }
        }
    
        //offset to get the current id
    private Integer getStatusDBIndex(int startFrom) {
    
            //Query the current id from the flume_meta table
            String dbIndex = queryOne("select currentIndex from flume_meta where source_tab='" + table + "'");
    
            if (dbIndex != null) {
                return Integer.parseInt(dbIndex);
            }
    
            //If there is no data, it means that it is the first query or that the data has not been stored in the data table, returning the original value.
            return startFrom;
        }
    
        //Execution statement to query a data (current id)
    private String queryOne(String sql) {
    
            ResultSet result = null;
    
            try {
                ps = conn.prepareStatement(sql);
                result = ps.executeQuery();
    
                while (result.next()) {
                    return result.getString(1);
                }
            } catch (SQLException e) {
                e.printStackTrace();
            }
    
            return null;
        }
    
        //Closing related resources
    void close() {
    
            try {
                ps.close();
                conn.close();
            } catch (SQLException e) {
                e.printStackTrace();
            }
        }
    
        int getCurrentIndex() {
            return currentIndex;
        }
    
        void setCurrentIndex(int newValue) {
            currentIndex = newValue;
        }
    
        int getRunQueryDelay() {
            return runQueryDelay;
        }
    
        String getQuery() {
            return query;
        }
    
        String getConnectionURL() {
            return connectionURL;
        }
    
        private boolean isCustomQuerySet() {
            return (customQuery != null);
        }
    
        Context getContext() {
            return context;
        }
    
        public String getConnectionUserName() {
            return connectionUserName;
        }
    
        public String getConnectionPassword() {
            return connectionPassword;
        }
    
        String getDefaultCharsetResultSet() {
            return defaultCharsetResultSet;
        }
    }
    
    

MySQLSource

Code implementation:

import java.text.ParseException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;

public class SQLSource extends AbstractSource implements Configurable, PollableSource {

    //Print logs
private static final Logger LOG = LoggerFactory.getLogger(SQLSource.class);

    //Define sqlHelper
    private SQLSourceHelper sqlSourceHelper;


    @Override
    public long getBackOffSleepIncrement() {
        return 0;
    }

    @Override
    public long getMaxBackOffSleepInterval() {
        return 0;
    }

    @Override
public void configure(Context context) {

        try {
            //Initialization
            sqlSourceHelper = new SQLSourceHelper(context);
        } catch (ParseException e) {
            e.printStackTrace();
        }
    }

    @Override
public Status process() throws EventDeliveryException {

        try {
            //Query data table
            List<List<Object>> result = sqlSourceHelper.executeQuery();

            //Collection of event s
            List<Event> events = new ArrayList<>();

            //Store event header collection
            HashMap<String, String> header = new HashMap<>();

            //If there is return data, encapsulate the data as event
            if (!result.isEmpty()) {

                List<String> allRows = sqlSourceHelper.getAllRows(result);

                Event event = null;

                for (String row : allRows) {
                    event = new SimpleEvent();
                    event.setBody(row.getBytes());
                    event.setHeaders(header);
                    events.add(event);
                }

                //Write event to channel
                this.getChannelProcessor().processEventBatch(events);

                //Update offset information in data tables
                sqlSourceHelper.updateOffset2DB(result.size());
            }

            //Waiting time
            Thread.sleep(sqlSourceHelper.getRunQueryDelay());

            return Status.READY;
        } catch (InterruptedException e) {
            LOG.error("Error procesing row", e);

            return Status.BACKOFF;
        }
    }

    @Override
public synchronized void stop() {

        LOG.info("Stopping sql source {} ...", getName());

        try {
            //close resource
            sqlSourceHelper.close();
        } finally {
            super.stop();
        }
    }
}

test

Jar Packet Homing

  1. Put MySql driver packages in Flume's lib directory
    [atguigu@hadoop102 flume]$ cp \
    /opt/sorfware/mysql-libs/mysql-connector-java-5.1.27/mysql-connector-java-5.1.27-bin.jar \
    /opt/module/flume/lib/
    
    1. Pack the project and place the Jar package in Flume's lib package directory.

Configuration File Homing

  1. Create a configuration file and open it
    [atguigu@hadoop102 job]$ touch mysql.conf
    [atguigu@hadoop102 job]$ vim mysql.conf 
    
  2. Add the following
    # Name the components on this agent
    a1.sources = r1
    a1.sinks = k1
    a1.channels = c1
    
    # Describe/configure the source
    a1.sources.r1.type = com.bw.flume.SQLSource  
    a1.sources.r1.connection.url = jdbc:mysql://192.168.9.102:3306/mysqlsource
    a1.sources.r1.connection.user = root  
    a1.sources.r1.connection.password = 000000  
    a1.sources.r1.table = student  
    a1.sources.r1.columns.to.select = *  
    #a1.sources.r1.incremental.column.name = id  
    #a1.sources.r1.incremental.value = 0 
    a1.sources.r1.run.query.delay=5000
    
    # Describe the sink
    a1.sinks.k1.type = logger
    
    # Describe the channel
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 100
    
    # Bind the source and sink to the channel
    a1.sources.r1.channels = c1
    a1.sinks.k1.channel = c1
    

MySql table ranking

  1. Create MySqlSource database

    CREATE DATABASE mysqlsource;
    
  2. Create data table Student and metadata table Flume_meta under MySqlSource database

    CREATE TABLE `student` (
    `id` int(11) NOT NULL AUTO_INCREMENT,
    `name` varchar(255) NOT NULL,
    PRIMARY KEY (`id`)
    );
    CREATE TABLE `flume_meta` (
    `source_tab` varchar(255) NOT NULL,
    `currentIndex` varchar(255) NOT NULL,
    PRIMARY KEY (`source_tab`)
    );
    
  3. Add data to tables

    1 zhangsan
    2 lisi
    3 wangwu
    4 zhaoliu
    

View the results

  1. Task execution

    [atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a1 \
    --conf-file job/mysql.conf -Dflume.root.logger=INFO,console
    
  2. Result presentation

Knowledge Expansion (Understanding)

Common regular expression grammar

Metacharacters describe
^ Matches the starting position of the input string. If the Multiline property of the RegExp object is set, ^ also matches the position after " n" or " r".
$ Matches the end position of the input string. If the Multiline property of the RegExp object is set, $also matches the position before " n" or " r".
* Match any previous subexpression. For example, zo* can match "z", "zo" and "zoo". * Equivalent to {0,}.
+ Match the previous subexpression one or more times (greater than or equal to one time). For example, "zo+" can match "zo" and "zoo", but not "z". + Equivalent to {1,}.
[a-z] Character range. Matches any character within the specified range. For example, "[a-z]" can match any lowercase letter character from "a" to "z". Note: Only when hyphens are inside a character group and appear between two characters, can they represent the range of characters; if the beginning of a character group is produced, only the hyphens themselves can be represented.

Enterprise Real Interview Questions

How do you monitor Flume data transmission?

Use third-party framework Ganglia to monitor Flume in real time.

Flume's Source, Sink, Channel role? What type of Source do you have?

  1. Effect

    Source components are designed to collect data and can process various types and formats of log data, including avro, thrift, exec, jms, spooling directory, netcat, sequence generator, syslog, http, legacy

    The Channel component caches the collected data and can be stored in either Moory or File.

Sink components are components used to send data to destinations, including Hdfs, Logger, avro, thrift, ipc, file, Hbase, solr, custom

2. The Source type adopted by our company is:

Monitor background logs: exec

Monitor the background log generation port: netcat

Channel Selectors of Flume

Flume parameter tuning

  1. Source

    Increasing the number of Sources (increasing the number of FileGroups when using Tair Dir Source) can increase the ability of Sources to read data. For example, when a directory produces too many files, it is necessary to split the file directory into multiple file directories and configure multiple Sources to ensure that Sources have sufficient capacity to obtain the newly generated data.

    The batchSize parameter determines the number of events that Source transports to Channel in a batch. Properly increasing this parameter can improve the performance of Source when it transports Event to Channel.

  2. Channel

    Channel performs best when type chooses memory, but may lose data if the Flume process crashes unexpectedly. Channel's fault tolerance is better when type chooses file, but its performance is worse than memory channel.

    Data Dirs can improve performance by configuring multiple directories on different disks when using file Channel.

    Capacity parameter determines the maximum number of events that Channel can accommodate. The transactionCapacity parameter determines the maximum number of event entries written by Source to channel and the maximum number of event entries read by Sink from channel each time. Transaction Capacity requires batchSize parameters larger than Source and Link.

  3. Sink

    Increasing the number of Sinks can increase Sink's ability to consume event s. Sink is not the more the better enough, too much Sink will occupy system resources, resulting in unnecessary waste of system resources.

    The batchSize parameter determines the number of events Sink reads from Channel in batches at a time. Properly adjusting this parameter can improve the performance of Sink moving event from Channel.

Transaction mechanism of Flume

Flume transaction mechanism (similar to database transaction mechanism): Flume uses two separate transactions to transfer events from Soucrce to Channel and from Channel to Sink, respectively. For example, spooling directory source creates an event for each line of a file. Once all events in a transaction are passed to Channel and submitted successfully, Soucrce marks the file as complete. Similarly, transactions handle the transfer process from Channel to Sink in a similar way, and if for some reason the event cannot be logged, the transaction will roll back. And all events are kept in Channel, waiting to be relayed.

Why can't Flume collect data without losing it?

Channel storage can be stored in File, and data transmission has its own transactions.

Posted by Magestic on Fri, 16 Aug 2019 06:53:44 -0700