The article is reproduced only for self-collection. See the original article.
https://www.jianshu.com/p/0802cf22a135
Preface
Any collection agent needs performance testing and resource limitation function testing before it can be fully promoted in the company to ensure that:
- Know clearly where the performance bottleneck is.
- Understand clearly where the performance bottleneck is when data soars;
- Understand clearly whether the resource restriction function is in effect when data soars.
Filebeat, a collection agent known as golang, has better performance than logstahs-forwarder. We also need to treat it rigorously.
Test details
Hardware environment and version
The hardware chooses virtual machine, 6cores + 16GB Mem + 175GB SSD + 1000Mbps bandwidth;
- Filebeat version 5.2.1, Kafka version 2.11-0.10.1.1;
- 172.16.134.3: kafka broker
- 172.16.134.8: Filebeat
Test plan
- Filebeat output to console, 6 processes write to the acquisition file, single line data 247B;
- Filebeat output to kafka, 6 processes write to the acquisition file, single line data 247B, gzip compression;
- Filebeat is output to kafka, which restricts the use of only 1 core, 6 processes to write to the collection file, 247B single-line data, gzip compression;
- Filebeat is output to kafka, which restricts the use of only 1 core, 6 processes to write to the collection file, single line data 1KB, gzip compression;
- Filebeat is output to kafka, which restricts the use of only 1 core, 6 processes to write to the collection file, single-line data 5KB, gzip compression;
Test configuration
Filebeat configuration, output to console:
Filebeat.prospectors: - input_type: log paths: - /home/admin/soft/Filebeat-5.2.1-linux-x86_64/logs/test document_type: Filebeat output.console: pretty: true
Filebeat configuration, output to Kafka:
Filebeat.prospectors: - input_type: log paths: - /home/admin/soft/Filebeat-5.2.1-linux-x86_64/logs/test document_type: Filebeat output.kafka: hosts: ["172.16.134.3:9092"] topic: '%{[type]}' partition.round_robin: reachable_only: false required_acks: 1 compression: gzip max_message_bytes: 1000000
Statistical method
We open the 6060 port of Filebeat and use python script to collect indicators.
expvar_rates.py, the Filebeat index is calculated every second. The main points are as follows:
- libeat.publisher.published_events, avg
- publish.events, avg
- libbeat.kafka.published_and_acked_events, avg
In terms of system metrics, we mainly focus on CPU usage, which uses top or tsar OK.
import requests
import argparse
import time
import curses
import re
def main():
parser = argparse.ArgumentParser(
description="Print per second stats from expvars")
parser.add_argument("--url", default="http://localhost:6060",
help="The URL from where to read the values")
parser.add_argument("--filter", default=None,
help="Filter metrics by this search regexp")
args = parser.parse_args()
stdscr = curses.initscr()
curses.noecho()
curses.cbreak()
last_vals = {}
# running average for last 30 measurements
N = 30
avg_vals = {}
now = time.time()
while True:
try:
time.sleep(1.0)
stdscr.erase()
r = requests.get(args.url + "/debug/vars")
json = r.json()
last = now
now = time.time()
dt = now - last
for key, total in json.items():
if args.filter is not None:
if re.search(args.filter, key) is None:
continue
if isinstance(total, (int, long, float)):
if key in last_vals:
per_sec = (total - last_vals[key])/dt
if key not in avg_vals:
avg_vals[key] = []
avg_vals[key].append(per_sec)
if len(avg_vals[key]) > N:
avg_vals[key] = avg_vals[key][1:]
avg_sec = sum(avg_vals[key])/len(avg_vals[key])
else:
per_sec = "na"
avg_sec = "na"
last_vals[key] = total
try:
stdscr.addstr("{}: {}/s (avg: {}/s) (total: {})\n"
.format(key, per_sec, avg_sec, total))
except Exception, e:
raise Exception("curses.addstr fail. Resize the " +
"terminal window or use the filter" +
"option: {}".format(e))
stdscr.refresh()
except requests.ConnectionError:
stdscr.addstr("Waiting for connection...\n")
stdscr.refresh()
last_vals = {}
avg_vals = {}
if __name__ == "__main__":
main()
Test details
We can use scripts.
time for i in {1..10000000};do echo "${log_message}" >> /home/admin/soft/filebeat-5.2.1-linux-x86_64/logs/test;done;
The execution time shows that the log write rate is ~250K/s. We can know from ls-l/proc/${pid}/fd | grep-v total | wc-l that the number of Filebeat handle for single file collection is 9, and the number of handle for two files collection is 10, so the number of base handle is 9, and the number of handle is + 1 for each additional collection point.
Case A: Filebeat output to console, 6 processes write to the acquisition file, single line data 247B;
Start Filebeat (172.16.134.8)
/home/admin/soft/filebeat-5.2.1-linux-x86_64/filebeat -c /home/admin/soft/filebeat-5.2.1-linux-x86_64/filebeat.yml -httpprof 0.0.0.0:6060 > /home/admin/soft/filebeat-5.2.1-linux-x86_64/logs/filebeat 2>&1 &
Step2. Start statistical scripts
python expvar_rates.py
Step3. Start tsar
tsar --cpu --live -i 1
Step4. Write piezometric data (6 processes write, 60 million logs)
for i in {1..10000000};do echo "2017-03-16T19:21:14+08:00 INFO Non-zero metrics in the last 30s: libbeat.publisher.published_events=2 registrar.states.update=3 libbeat.kafka.call_count.PublishEvents=2 publish.events=3 libbeat.kafka.published_and_acked_events=2 registrar.writes=2" >> /home/admin/soft/filebeat-5.2.1-linux-x86_64/logs/test;done;
When the 6-process data is written to the log file, we open the window of the python statistical script to get the following stable statistics:
libbeat.publisher.published_events: 18014.370905/s (avg: 19641.6123106/s) publish.events: 21920.7951289/s (avg: 19610.1652594/s)
The statistics we see in tsar are as follows:
user sys wait hirq sirq util
67.3 22.8 0.0 0.0 0.0 90.2
We can see in top that Filebeat accounts for roughly 0.8 cores.
After the 6-process data is written to the log file, we open the window of the python statistical script to get the following stable statistics:
libbeat.publisher.published_events: 40691.6042823/s (avg: 39300.4436095/s) publish.events: 38676.4998717/s (avg: 39370.3604289/s)
The statistics we see in tsar are as follows:
user sys wait hirq sirq util
24.6 2.0 0.2 0.0 0.0 26.9
We can see in top that Filebeat accounts for roughly 1.6 cores.
Summary:
- When the log files of 6 processes are written, the application itself will eat up all the CPU resources. At this time, the Filebeat acquisition rate can not keep up with the data generation speed, which is about 19 K/s, and the CPU overhead is 0.8 cores.
- After 60 million logs were written, Filebeat collected data was pushed to console at full speed at a rate of approximately 39 K/s, with CPU overhead of 1.6 cores.
Case B: Filebeat output to kafka, 6 processes write to the acquisition file, single line data 247B, gzip compression;
The test steps are consistent with the above, but the difference is that the configuration file needs to be exported to Kafka.
When the 6-process data is written to the log file, we open the window of the python statistical script to get the following stable statistics:
libbeat.kafka.published_and_acked_events: 14125.0820764/s (avg: 16282.8981012/s) libbeat.publisher.published_events: 17357.7280373/s (avg: 16419.7944927/s) publish.events: 16251.4829085/s (avg: 16510.4781086/s)
The statistics we see in tsar are as follows:
user sys wait hirq sirq util
67.9 23.1 0.0 0.0 0.0 91.1
We can see in top that Filebeat occupies roughly 0.7-0.8 cores.
After the 6-process data is written to the log file, we open the window of the python statistical script to get the following stable statistics:
libbeat.kafka.published_and_acked_events: 40614.7704231/s (avg: 39698.759785/s) libbeat.publisher.published_events: 40722.3439941/s (avg: 39628.4561631/s) publish.events: 38671.5053801/s (avg: 38930.2225183/s)
The statistics we see in tsar are as follows:
user sys wait hirq sirq util
29.2 0.9 0.0 0.0 0.0 30.5
We can see in top that Filebeat accounts for roughly 2.0 cores.
Summary:
- When the log files of 6 processes are written, the application itself will eat up all the CPU resources. At this time, the Filebeat acquisition rate can not keep up with the data generation speed, which is about 16-17K/s, and the CPU overhead is 0.7-0.8 cores.
- After 60 million logs were written, Filebeat collected data was pushed to Kafka at full speed at a rate of 39-40K/s and CPU overhead of 2.0 cores.
Case C: Filebeat is exported to kafka, which limits the use of only 1 core, 6 processes to write to the acquisition file, 247B single-line data, gzip compression;
The test steps are consistent with the above, but the difference is that the configuration file needs to be exported to Kafka.
Unlike the above steps, system D is required to limit CPU and handle number when starting Filebeat. According to previous theories, the limit of handle number to 100 is very sufficient and CPU to 1 core.
Modify/usr/lib/systemd/system/filebeat.service:
[Unit] Description=Filebeat Documentation=https://www.elastic.co/guide After=network.target [Service] Type=Simple ExecStart=/home/admin/soft/filebeat-5.2.1-linux-x86_64/filebeat -c /home/admin/soft/filebeat-5.2.1-linux-x86_64/filebeat.yml -httpprof 0.0.0.0:6060 ExecStop=/bin/kill -WINCH ${MAINPID} Restart=always RestartSec=0 WatchdogSec=1min LimitNOFILE=100 LimitNPROC=100 [Install] WantedBy=multi-user.target
Execute reload:
sudo systemctl daemon-reload
Restrict CPU:
sudo systemctl set-property filebeat CPUQuota=100%
Confirm whether success is limited:
sudo systemctl show filebeat | grep -E "CPUQuota|LimitNOFILE|LimitNPROC"
The following output denotes OK:
CPUQuotaPerSecUSec=1s LimitNOFILE=100 LimitNPROC=100 DropInPaths=/etc/systemd/system/filebeat.service.d/50-CPUQuota.conf
When the 6-process data is written to the log file, we open the window of the python statistical script to get the following stable statistics:
libbeat.kafka.published_and_acked_events: 19844.7720225/s (avg: 16837.4773012/s) libbeat.publisher.published_events: 18030.6556898/s (avg: 16764.6918644/s) publish.events: 15644.7977205/s (avg: 16910.2166832/s)
The statistics we see in tsar are as follows:
user sys wait hirq sirq util
79.2 11.3 0.0 0.0 0.0 90.6
We can see in top that Filebeat accounts for roughly 0.7 to 0.8 cores.
After the 6-process data is written to the log file, we open the window of the python statistical script to get the following stable statistics:
libbeat.kafka.published_and_acked_events: 18337.7079915/s (avg: 18923.3234652/s) libbeat.publisher.published_events: 20376.3669159/s (avg: 18848.5970484/s) publish.events: 20323.8510655/s (avg: 18914.4681101/s)
The statistics we see in tsar are as follows:
user sys wait hirq sirq util
14.4 0.5 0.0 0.0 0.0 15.2
We can see in top that Filebeat accounts for roughly 1.0 cores, and the restriction takes effect.
Summary:
- When the log files of 6 processes are written, the application itself will eat up all the CPU resources. At this time, the Filebeat acquisition rate can not keep up with the data generation speed, which is about 16-17K/s, and the CPU overhead is 0.7-0.8 cores.
- After 60 million logs were written, Filebeat collected data was pushed to Kfaka at full speed at a rate of 18-19 K/s and CPU overhead of 1.0 cores.
Case D: Filebeat is exported to kafka, which limits the use of only 1 core, 6 processes to write to the acquisition file, 1 KB of single-line data, and gzip compression.
When the 6-process data is written to the log file, we open the window of the python statistical script to get the following stable statistics:
libbeat.kafka.published_and_acked_events: 11954.3900946/s (avg: 13256.2128845/s) libbeat.publisher.published_events: 14084.2447431/s (avg: 13241.4999292/s) publish.events: 11998.8163015/s (avg: 13205.6630096/s)
The statistics we see in tsar are as follows:
user sys wait hirq sirq util
85.4 5.0 0.1 0.0 0.0 90.5
We can see in top that Filebeat accounts for roughly 0.75 to 0.9 cores.
After the 6-process data is written to the log file, we open the window of the python statistical script to get the following stable statistics:
libbeat.kafka.published_and_acked_events: 14257.5686603/s (avg: 15411.161373/s) libbeat.publisher.published_events: 16297.7185003/s (avg: 15475.6613026/s) publish.events: 16292.4519132/s (avg: 15545.8192101/s)
The statistics we see in tsar are as follows:
user sys wait hirq sirq util
14.5 0.6 0.0 0.0 0.0 15.3
We can see in top that Filebeat accounts for roughly 1.0 cores, and the restriction takes effect.
Summary:
- When the log files of 6 processes are written, the application itself will eat up all the CPU resources. At this time, the Filebeat acquisition rate can not keep up with the data generation speed, which is about 13-14K/s, and the CPU overhead is 0.75-0.9 cores.
- After 60 million logs have been written, Filebeat collects data and pushes it to console at full speed at a rate of approximately 15-16K/s, with CPU overhead of 1.0 cores.
Case E: Filebeat is exported to kafka, which limits the use of only 1 core, 6 processes to write to the acquisition file, single line data 5KB, gzip compression;
When the 6-process data is written to the log file, we open the window of the python statistical script to get the following stable statistics:
libbeat.kafka.published_and_acked_events: 5853.48499116/s (avg: 6142.33474568/s) libbeat.publisher.published_events: 6052.63549446/s (avg: 6066.04758752/s) publish.events: 5979.76512269/s (avg: 6063.20532399/s)
The statistics we see in tsar are as follows:
user sys wait hirq sirq util
89.2 2.2 0.1 0.0 0.0 91.5
We can see in top that Filebeat accounts for roughly 0.7 to 0.75 cores.
After the 6-process data is written to the log file, we open the window of the python statistical script to get the following stable statistics:
libbeat.kafka.published_and_acked_events: 6110.43003481/s (avg: 7030.51292764/s) libbeat.publisher.published_events: 8144.14448334/s (avg: 7030.65813196/s) publish.events: 6112.44754178/s (avg: 6977.26856088/s)
The statistics we see in tsar are as follows:
user sys wait hirq sirq util
12.1 0.5 0.1 0.0 0.0 12.9
We can see in top that Filebeat accounts for roughly 0.9 cores, not up to the limit.
Summary:
- When the log files of 6 processes are written, the application itself will eat up all the CPU resources. At this time, the Filebeat acquisition rate can not keep up with the data generation speed, which is about 6K/s, and the CPU overhead is 0.7-0.75 cores.
- After 60 million logs were written, Filebeat collected data was pushed to Kafka at full speed at a rate of approximately 7K/s, with CPU overhead of 0.9 cores.
Test conclusion
Situation | Thoughput(K/s) | CPU(Cores) |
---|---|---|
Output to Console, Application Writing + FB Acquisition, 247B | 19 | 0.8 |
Output to Console, FB acquisition, 247B | 39 | 1.6 |
Output to Kafka, application write + FB acquisition, 247B | 16.5 | 0.75 |
Output to Kafka, FB acquisition, 247B | 39.5 | 2.0 |
Output to Kafka, application write + FB acquisition, 247B, 1Core | 16.5 | 0.75 |
Output to Kafka, FB acquisition, 247B, 1Core | 18.5 | 1.0 |
Output to Kafka, application write + FB acquisition, 1KB, 1Core | 13.5 | 0.83 |
Output to Kafka, FB acquisition, 1KB, 1Core | 15.5 | 1.0 |
Output to Kafka, application write + FB acquisition, 5KB, 1Core | 6 | 0.73 |
Output to Kafka, FB acquisition, 5KB, 1Core | 7 | 0.9 |
A. FB collects 247B data at a rate of ~40K/s and CPU overhead of 2 cores.
B. Under the CPU limit of 1 core s, FB collects 247B data at a rate of ~ 20K/s, which can be considered as ~ 20K/s/core for a single core.
C. The larger the single-line log data is, the smaller the throughput is. 5KB per line has been exaggerated. Even so, without compression, the bandwidth consumption is 35MBps, the gzip compression rate is generally 0.3-0.4, and the occupied bandwidth is 10.5-14MBps, which is less pressure for Gigabit Network card.
In the case of production, even if there is gzip compression, there must be downgrading measures to avoid the network bandwidth occupied by the log due to abnormal surge.
Author: Heart of Big Data
Link: https://www.jianshu.com/p/0802 cf22a135
Source: Brief Book
Copyright belongs to the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.