Centralized Log Analysis Platform - ELK Stack - Filebeat Pressure Measurement

Keywords: kafka Python Linux JSON

The article is reproduced only for self-collection. See the original article.

https://www.jianshu.com/p/0802cf22a135


Preface

Any collection agent needs performance testing and resource limitation function testing before it can be fully promoted in the company to ensure that:

  • Know clearly where the performance bottleneck is.
  • Understand clearly where the performance bottleneck is when data soars;
  • Understand clearly whether the resource restriction function is in effect when data soars.

Filebeat, a collection agent known as golang, has better performance than logstahs-forwarder. We also need to treat it rigorously.

Test details

Hardware environment and version

The hardware chooses virtual machine, 6cores + 16GB Mem + 175GB SSD + 1000Mbps bandwidth;

  • Filebeat version 5.2.1, Kafka version 2.11-0.10.1.1;
  • 172.16.134.3: kafka broker
  • 172.16.134.8: Filebeat

Test plan

  • Filebeat output to console, 6 processes write to the acquisition file, single line data 247B;
  • Filebeat output to kafka, 6 processes write to the acquisition file, single line data 247B, gzip compression;
  • Filebeat is output to kafka, which restricts the use of only 1 core, 6 processes to write to the collection file, 247B single-line data, gzip compression;
  • Filebeat is output to kafka, which restricts the use of only 1 core, 6 processes to write to the collection file, single line data 1KB, gzip compression;
  • Filebeat is output to kafka, which restricts the use of only 1 core, 6 processes to write to the collection file, single-line data 5KB, gzip compression;

Test configuration

Filebeat configuration, output to console:

Filebeat.prospectors:
 - input_type: log
  paths:
    - /home/admin/soft/Filebeat-5.2.1-linux-x86_64/logs/test
  document_type: Filebeat
output.console:
  pretty: true

Filebeat configuration, output to Kafka:

Filebeat.prospectors:
 - input_type: log
  paths:
    - /home/admin/soft/Filebeat-5.2.1-linux-x86_64/logs/test
  document_type: Filebeat
output.kafka:
  hosts: ["172.16.134.3:9092"]
  topic: '%{[type]}'
  partition.round_robin:
    reachable_only: false
  required_acks: 1
  compression: gzip
  max_message_bytes: 1000000

Statistical method

We open the 6060 port of Filebeat and use python script to collect indicators.
expvar_rates.py, the Filebeat index is calculated every second. The main points are as follows:

  • libeat.publisher.published_events, avg
  • publish.events, avg
  • libbeat.kafka.published_and_acked_events, avg
    In terms of system metrics, we mainly focus on CPU usage, which uses top or tsar OK.
import requests
import argparse
import time
import curses
import re


def main():
    parser = argparse.ArgumentParser(
        description="Print per second stats from expvars")
    parser.add_argument("--url", default="http://localhost:6060",
                        help="The URL from where to read the values")
    parser.add_argument("--filter", default=None,
                        help="Filter metrics by this search regexp")
    args = parser.parse_args()

    stdscr = curses.initscr()
    curses.noecho()
    curses.cbreak()

    last_vals = {}

    # running average for last 30 measurements
    N = 30
    avg_vals = {}
    now = time.time()

    while True:
        try:
            time.sleep(1.0)
            stdscr.erase()

            r = requests.get(args.url + "/debug/vars")
            json = r.json()

            last = now
            now = time.time()
            dt = now - last

            for key, total in json.items():
                if args.filter is not None:
                    if re.search(args.filter, key) is None:
                        continue

                if isinstance(total, (int, long, float)):
                    if key in last_vals:
                        per_sec = (total - last_vals[key])/dt
                        if key not in avg_vals:
                            avg_vals[key] = []
                        avg_vals[key].append(per_sec)
                        if len(avg_vals[key]) > N:
                            avg_vals[key] = avg_vals[key][1:]
                        avg_sec = sum(avg_vals[key])/len(avg_vals[key])
                    else:
                        per_sec = "na"
                        avg_sec = "na"
                    last_vals[key] = total
                    try:
                        stdscr.addstr("{}: {}/s (avg: {}/s) (total: {})\n"
                                      .format(key, per_sec, avg_sec, total))
                    except Exception, e:
                        raise Exception("curses.addstr fail. Resize the " +
                                        "terminal window or use the filter" +
                                        "option: {}".format(e))
            stdscr.refresh()
        except requests.ConnectionError:
            stdscr.addstr("Waiting for connection...\n")
            stdscr.refresh()
            last_vals = {}
            avg_vals = {}

if __name__ == "__main__":
    main()

Test details

We can use scripts.

time for i in {1..10000000};do echo "${log_message}" >> /home/admin/soft/filebeat-5.2.1-linux-x86_64/logs/test;done;

The execution time shows that the log write rate is ~250K/s. We can know from ls-l/proc/${pid}/fd | grep-v total | wc-l that the number of Filebeat handle for single file collection is 9, and the number of handle for two files collection is 10, so the number of base handle is 9, and the number of handle is + 1 for each additional collection point.

Case A: Filebeat output to console, 6 processes write to the acquisition file, single line data 247B;

Start Filebeat (172.16.134.8)

/home/admin/soft/filebeat-5.2.1-linux-x86_64/filebeat -c /home/admin/soft/filebeat-5.2.1-linux-x86_64/filebeat.yml -httpprof 0.0.0.0:6060 > /home/admin/soft/filebeat-5.2.1-linux-x86_64/logs/filebeat 2>&1 &

Step2. Start statistical scripts

python expvar_rates.py

Step3. Start tsar

tsar --cpu --live -i 1

Step4. Write piezometric data (6 processes write, 60 million logs)

for i in {1..10000000};do echo "2017-03-16T19:21:14+08:00 INFO Non-zero metrics in the last 30s: libbeat.publisher.published_events=2 registrar.states.update=3 libbeat.kafka.call_count.PublishEvents=2 publish.events=3 libbeat.kafka.published_and_acked_events=2 registrar.writes=2" >> /home/admin/soft/filebeat-5.2.1-linux-x86_64/logs/test;done; 

When the 6-process data is written to the log file, we open the window of the python statistical script to get the following stable statistics:

libbeat.publisher.published_events: 18014.370905/s (avg: 19641.6123106/s) 
publish.events: 21920.7951289/s (avg: 19610.1652594/s)

The statistics we see in tsar are as follows:

user     sys    wait    hirq    sirq    util
67.3    22.8     0.0     0.0     0.0    90.2 

We can see in top that Filebeat accounts for roughly 0.8 cores.
After the 6-process data is written to the log file, we open the window of the python statistical script to get the following stable statistics:

libbeat.publisher.published_events: 40691.6042823/s (avg: 39300.4436095/s) 
publish.events: 38676.4998717/s (avg: 39370.3604289/s) 

The statistics we see in tsar are as follows:

user     sys    wait    hirq    sirq    util
24.6     2.0     0.2     0.0     0.0    26.9 

We can see in top that Filebeat accounts for roughly 1.6 cores.
Summary:

  • When the log files of 6 processes are written, the application itself will eat up all the CPU resources. At this time, the Filebeat acquisition rate can not keep up with the data generation speed, which is about 19 K/s, and the CPU overhead is 0.8 cores.
  • After 60 million logs were written, Filebeat collected data was pushed to console at full speed at a rate of approximately 39 K/s, with CPU overhead of 1.6 cores.
Case B: Filebeat output to kafka, 6 processes write to the acquisition file, single line data 247B, gzip compression;

The test steps are consistent with the above, but the difference is that the configuration file needs to be exported to Kafka.
When the 6-process data is written to the log file, we open the window of the python statistical script to get the following stable statistics:

libbeat.kafka.published_and_acked_events: 14125.0820764/s (avg: 16282.8981012/s) 
libbeat.publisher.published_events: 17357.7280373/s (avg: 16419.7944927/s)
publish.events: 16251.4829085/s (avg: 16510.4781086/s) 

The statistics we see in tsar are as follows:

user     sys    wait    hirq    sirq    util
67.9    23.1     0.0     0.0     0.0    91.1  

We can see in top that Filebeat occupies roughly 0.7-0.8 cores.
After the 6-process data is written to the log file, we open the window of the python statistical script to get the following stable statistics:

libbeat.kafka.published_and_acked_events: 40614.7704231/s (avg: 39698.759785/s) 
libbeat.publisher.published_events: 40722.3439941/s (avg: 39628.4561631/s) 
publish.events: 38671.5053801/s (avg: 38930.2225183/s) 

The statistics we see in tsar are as follows:

user     sys    wait    hirq    sirq    util
29.2     0.9     0.0     0.0     0.0    30.5

We can see in top that Filebeat accounts for roughly 2.0 cores.
Summary:

  • When the log files of 6 processes are written, the application itself will eat up all the CPU resources. At this time, the Filebeat acquisition rate can not keep up with the data generation speed, which is about 16-17K/s, and the CPU overhead is 0.7-0.8 cores.
  • After 60 million logs were written, Filebeat collected data was pushed to Kafka at full speed at a rate of 39-40K/s and CPU overhead of 2.0 cores.
Case C: Filebeat is exported to kafka, which limits the use of only 1 core, 6 processes to write to the acquisition file, 247B single-line data, gzip compression;

The test steps are consistent with the above, but the difference is that the configuration file needs to be exported to Kafka.
Unlike the above steps, system D is required to limit CPU and handle number when starting Filebeat. According to previous theories, the limit of handle number to 100 is very sufficient and CPU to 1 core.
Modify/usr/lib/systemd/system/filebeat.service:

[Unit]
Description=Filebeat
Documentation=https://www.elastic.co/guide
After=network.target

[Service]
Type=Simple
ExecStart=/home/admin/soft/filebeat-5.2.1-linux-x86_64/filebeat -c /home/admin/soft/filebeat-5.2.1-linux-x86_64/filebeat.yml -httpprof 0.0.0.0:6060
ExecStop=/bin/kill -WINCH ${MAINPID}
Restart=always
RestartSec=0
WatchdogSec=1min
LimitNOFILE=100
LimitNPROC=100

[Install]
WantedBy=multi-user.target

Execute reload:

sudo systemctl daemon-reload

Restrict CPU:

sudo systemctl set-property filebeat CPUQuota=100% 

Confirm whether success is limited:

sudo systemctl show filebeat | grep -E "CPUQuota|LimitNOFILE|LimitNPROC"

The following output denotes OK:

CPUQuotaPerSecUSec=1s
LimitNOFILE=100
LimitNPROC=100
DropInPaths=/etc/systemd/system/filebeat.service.d/50-CPUQuota.conf

When the 6-process data is written to the log file, we open the window of the python statistical script to get the following stable statistics:

libbeat.kafka.published_and_acked_events: 19844.7720225/s (avg: 16837.4773012/s) 
libbeat.publisher.published_events: 18030.6556898/s (avg: 16764.6918644/s) 
publish.events: 15644.7977205/s (avg: 16910.2166832/s) 

The statistics we see in tsar are as follows:

user     sys    wait    hirq    sirq    util
79.2    11.3     0.0     0.0     0.0    90.6

We can see in top that Filebeat accounts for roughly 0.7 to 0.8 cores.
After the 6-process data is written to the log file, we open the window of the python statistical script to get the following stable statistics:

libbeat.kafka.published_and_acked_events: 18337.7079915/s (avg: 18923.3234652/s) 
libbeat.publisher.published_events: 20376.3669159/s (avg: 18848.5970484/s) 
publish.events: 20323.8510655/s (avg: 18914.4681101/s) 

The statistics we see in tsar are as follows:

user     sys    wait    hirq    sirq    util
14.4     0.5     0.0     0.0     0.0    15.2 

We can see in top that Filebeat accounts for roughly 1.0 cores, and the restriction takes effect.
Summary:

  • When the log files of 6 processes are written, the application itself will eat up all the CPU resources. At this time, the Filebeat acquisition rate can not keep up with the data generation speed, which is about 16-17K/s, and the CPU overhead is 0.7-0.8 cores.
  • After 60 million logs were written, Filebeat collected data was pushed to Kfaka at full speed at a rate of 18-19 K/s and CPU overhead of 1.0 cores.
Case D: Filebeat is exported to kafka, which limits the use of only 1 core, 6 processes to write to the acquisition file, 1 KB of single-line data, and gzip compression.

When the 6-process data is written to the log file, we open the window of the python statistical script to get the following stable statistics:

libbeat.kafka.published_and_acked_events: 11954.3900946/s (avg: 13256.2128845/s) 
libbeat.publisher.published_events: 14084.2447431/s (avg: 13241.4999292/s)
publish.events: 11998.8163015/s (avg: 13205.6630096/s) 

The statistics we see in tsar are as follows:

user     sys    wait    hirq    sirq    util
85.4     5.0     0.1     0.0     0.0    90.5

We can see in top that Filebeat accounts for roughly 0.75 to 0.9 cores.
After the 6-process data is written to the log file, we open the window of the python statistical script to get the following stable statistics:

libbeat.kafka.published_and_acked_events: 14257.5686603/s (avg: 15411.161373/s)
libbeat.publisher.published_events: 16297.7185003/s (avg: 15475.6613026/s)
publish.events: 16292.4519132/s (avg: 15545.8192101/s) 

The statistics we see in tsar are as follows:

user     sys    wait    hirq    sirq    util
14.5     0.6     0.0     0.0     0.0    15.3

We can see in top that Filebeat accounts for roughly 1.0 cores, and the restriction takes effect.
Summary:

  • When the log files of 6 processes are written, the application itself will eat up all the CPU resources. At this time, the Filebeat acquisition rate can not keep up with the data generation speed, which is about 13-14K/s, and the CPU overhead is 0.75-0.9 cores.
  • After 60 million logs have been written, Filebeat collects data and pushes it to console at full speed at a rate of approximately 15-16K/s, with CPU overhead of 1.0 cores.
Case E: Filebeat is exported to kafka, which limits the use of only 1 core, 6 processes to write to the acquisition file, single line data 5KB, gzip compression;

When the 6-process data is written to the log file, we open the window of the python statistical script to get the following stable statistics:

libbeat.kafka.published_and_acked_events: 5853.48499116/s (avg: 6142.33474568/s) 
libbeat.publisher.published_events: 6052.63549446/s (avg: 6066.04758752/s) 
publish.events: 5979.76512269/s (avg: 6063.20532399/s) 

The statistics we see in tsar are as follows:

user     sys    wait    hirq    sirq    util
89.2     2.2     0.1     0.0     0.0    91.5 

We can see in top that Filebeat accounts for roughly 0.7 to 0.75 cores.
After the 6-process data is written to the log file, we open the window of the python statistical script to get the following stable statistics:

libbeat.kafka.published_and_acked_events: 6110.43003481/s (avg: 7030.51292764/s)
libbeat.publisher.published_events: 8144.14448334/s (avg: 7030.65813196/s) 
publish.events: 6112.44754178/s (avg: 6977.26856088/s)

The statistics we see in tsar are as follows:

user     sys    wait    hirq    sirq    util
12.1     0.5     0.1     0.0     0.0    12.9 

We can see in top that Filebeat accounts for roughly 0.9 cores, not up to the limit.
Summary:

  • When the log files of 6 processes are written, the application itself will eat up all the CPU resources. At this time, the Filebeat acquisition rate can not keep up with the data generation speed, which is about 6K/s, and the CPU overhead is 0.7-0.75 cores.
  • After 60 million logs were written, Filebeat collected data was pushed to Kafka at full speed at a rate of approximately 7K/s, with CPU overhead of 0.9 cores.

Test conclusion

Situation Thoughput(K/s) CPU(Cores)
Output to Console, Application Writing + FB Acquisition, 247B 19 0.8
Output to Console, FB acquisition, 247B 39 1.6
Output to Kafka, application write + FB acquisition, 247B 16.5 0.75
Output to Kafka, FB acquisition, 247B 39.5 2.0
Output to Kafka, application write + FB acquisition, 247B, 1Core 16.5 0.75
Output to Kafka, FB acquisition, 247B, 1Core 18.5 1.0
Output to Kafka, application write + FB acquisition, 1KB, 1Core 13.5 0.83
Output to Kafka, FB acquisition, 1KB, 1Core 15.5 1.0
Output to Kafka, application write + FB acquisition, 5KB, 1Core 6 0.73
Output to Kafka, FB acquisition, 5KB, 1Core 7 0.9
     

A. FB collects 247B data at a rate of ~40K/s and CPU overhead of 2 cores.

B. Under the CPU limit of 1 core s, FB collects 247B data at a rate of ~ 20K/s, which can be considered as ~ 20K/s/core for a single core.

C. The larger the single-line log data is, the smaller the throughput is. 5KB per line has been exaggerated. Even so, without compression, the bandwidth consumption is 35MBps, the gzip compression rate is generally 0.3-0.4, and the occupied bandwidth is 10.5-14MBps, which is less pressure for Gigabit Network card.

In the case of production, even if there is gzip compression, there must be downgrading measures to avoid the network bandwidth occupied by the log due to abnormal surge.



Author: Heart of Big Data
Link: https://www.jianshu.com/p/0802 cf22a135
Source: Brief Book
Copyright belongs to the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.

Posted by sarika on Thu, 10 Jan 2019 22:21:10 -0800