Fluent Bit is a fast and lightweight log processor. As part of our continuous development and testing model, we provide specific tools to test performance under different data load scenarios.
The objective of our performance tooling is to gather the following insights:
Upon N number of records/data ingestion, measure:
CPU usage, user time and system time
Memory usage
Total time required to process the data
Tests aim to run for a fixed number of time, data load can be increased per round. Every tool is able to monitor and collect resource usage metrics from the tested Fluent Bit or another logging tool as a target.
we focused in a specific set of metrics only which are enough for the purpose of the testing.
How it Works
Every tool available is written on top of a generic framework that provides interfaces to load data files, gather metrics nad generate a report from a running process.
In the following diagram, using flb-tail-writer tool as an example, it writes N amount of records (lines) to a custom log file, in a separate session, Fluent Bit through Tail input plugin reads information from the file generated. Internally the Linux Kernel exposes Fluent Bit process metrics through ProcFS, where flb-tail-writer before and after every write/round operation gather metrics and provides insights of resources consumption.
+-----------------+ +----------------+
| Proc FS (/proc) +<----------------+ Linux Kernel |
+-----+-----+-----+ +-----+----+-----+
| ^ | ^
v | v |
+-----+-----+-----+ +-----+----+-----+
| | | |
| FLB Tail Writer +-----+ +-----+ Fluent Bit |
| | | | | |
+--------+--------+ | | +----------------+
| | |
v v v
+------+------+ +-+-----+----------+
| Test Report | | /var/log/out.log |
+-------------+ +------------------+
As an example, consider the following test using flb-tail-writer where:
Reads samples of data from data.log file
Output data will be written to out.log file
Write 1000000 records (log lines) every second, 10 times (called as 10 seconds).
Monitor resources usage of Fluent Bit process ID (PID).
Stop Monitoring Fluent Bit process once the process becomes almost idle for 3 seconds.
The report have two panes, left side belongs to the information provided by the perf tools in terms
of data ingestion and the right side the metrics collected from the monitored process.
Left Pane
The information on the left side of the report belongs to a summary of data samples sent to the target service.
Column
Description
records
Number of records ingested in the specific round.
write (b)
Total number of bytes written.
write
Human readable version of written bytes.
secs
Elapsed time on writing the data.
Right Pane
Overall metrics from the monitored process when the -p PID parameter is used.
Column
Description
% cpu
Represents the CPU time used by the process in user and system space during the time (secs) that the performance tool was writing data.
user (ms)
CPU time spent in milliseconds in user time (user space)
sys (ms)
CPU time spent in milliseconds in system time (kernel space).
Memory
Number of bytes in memory (RSS) currently used by the process after writing the data and waiting for one second.
Performance is always critical and when managing data at high scale there are many corners where is possible to improve and make it better.
When measuring performance is important to understand the variables that can affect a running monitored service. If you are comparing same tool like Fluent Bit v/s Fluent Bit is not a hard task, but if you aim to compare Fluent Bit against other solution in the same space, you have to do an extra work and make sure that the setup and conditions are the same, e.g: make sure buffer sizes are the same on both tools.
Running a performance test using default options in different services will lead to unreliable results.
Story: some years ago I was working in one of my HTTP servers projects. We got into a benchmark virtual battle against a proprietary web server. They claimed aims to be faster that all open source options available (e.g: Nginx, Lighttpd, Apache, etc)… and benchmark results shows that their project was outstanding leaving every other project behind.
After digging a bit more and starting measuring what was doing that web server from an operating system level, we ended up discovering that it was caching every HTTP request and response without extra checks, so if it get one million request for the same end-point in a Keep-Alive session, it sent the same response over and over, without the expected processing. Basically it was prepared before hand to cheat if it was benchmarked.
Upon sending a HTTP request with an URI that changed the query string variable (e.g: /?a=1..) every time, it was slow as hell :)
On that moment I learn how important was to measure every aspect of a running service. That’s why the simple metrics of CPU time in user/kernel space and memory usage are really important.
final tip: if you are the user, try to do your own benchmarks for your own conditions and scenario. Trust in our performance tooling but don’t trust in benchmarks reports made by us (maintainers) or vendors XD .
Fluent Bit Performance Test Tools
Fluent Bit is a fast and lightweight log processor. As part of our continuous development and testing model, we provide specific tools to test performance under different data load scenarios.
The objective of our performance tooling is to gather the following insights:
Tests aim to run for a fixed number of time, data load can be increased per round. Every tool is able to monitor and collect resource usage metrics from the tested Fluent Bit or another logging tool as a target.
How it Works
Every tool available is written on top of a generic framework that provides interfaces to load data files, gather metrics nad generate a report from a running process.
In the following diagram, using
flb-tail-writertool as an example, it writes N amount of records (lines) to a custom log file, in a separate session, Fluent Bit through Tail input plugin reads information from the file generated. Internally the Linux Kernel exposes Fluent Bit process metrics through ProcFS, where flb-tail-writer before and after every write/round operation gather metrics and provides insights of resources consumption.As an example, consider the following test using
flb-tail-writerwhere:Report Details
The report have two panes, left side belongs to the information provided by the perf tools in terms of data ingestion and the right side the metrics collected from the monitored process.
Left Pane
The information on the left side of the report belongs to a summary of data samples sent to the target service.
Right Pane
Overall metrics from the monitored process when the
-p PIDparameter is used.Tools Available
Build Instructions
Requirements
Build
Run the following command to compile the tools:
Comments about Performance and Benchmarking
Performance is always critical and when managing data at high scale there are many corners where is possible to improve and make it better.
When measuring performance is important to understand the variables that can affect a running monitored service. If you are comparing same tool like Fluent Bit v/s Fluent Bit is not a hard task, but if you aim to compare Fluent Bit against other solution in the same space, you have to do an extra work and make sure that the setup and conditions are the same, e.g: make sure buffer sizes are the same on both tools.
Story: some years ago I was working in one of my HTTP servers projects. We got into a benchmark virtual battle against a proprietary web server. They claimed aims to be faster that all open source options available (e.g: Nginx, Lighttpd, Apache, etc)… and benchmark results shows that their project was outstanding leaving every other project behind.
After digging a bit more and starting measuring what was doing that web server from an operating system level, we ended up discovering that it was caching every HTTP request and response without extra checks, so if it get one million request for the same end-point in a Keep-Alive session, it sent the same response over and over, without the expected processing. Basically it was prepared before hand to cheat if it was benchmarked.
On that moment I learn how important was to measure every aspect of a running service. That’s why the simple metrics of CPU time in user/kernel space and memory usage are really important.
final tip: if you are the user, try to do your own benchmarks for your own conditions and scenario. Trust in our performance tooling but don’t trust in benchmarks reports made by us (maintainers) or vendors XD .
License
This program is under the terms of the Apache License v2.0.
Authors