Skip to content

v1.44.0

Compare
Choose a tag to compare
@netdatabot netdatabot released this 06 Dec 18:15
· 1122 commits to master since this release

Table of Contents

Steady to our schedule, this is another great Netdata release!

Important

Stay informed about upcoming changes and potential deprecations by reviewing the deprecation notice sections. This will help you plan for any necessary adjustments to ensure a smooth transition.

Netdata Growth

  • 66k+ GitHub Stars ⭐
    Since October 2023, Netdata is leading the observability category in the CNCF landscape, surpassing Elasticsearch. Thank you for your love ❤️! Give Netdata a ⭐ too, on GitHub!

  • 600M+ docker hub pulls
    Netdata runs with about 200k docker hub downloads per day. Since June 2023 we are a Verified Publisher, so that Netdata pulls don't count against docker hub pull limits for our users, allowing all our users to integrate Netdata to their CI/CD toolchains.

Release Summary

  • Netdata beats Prometheus in all aspects: this version of Netdata includes significant improvement allowing Netdata to be a lot more performant than Prometheus, at scale. Full performance analysis included.
  • Netdata Journal Logs: Netdata can now deal with huge systemd-journal databases and is available for the host logs when Netdata runs in a container.
  • First beta version of Netdata's log2journal: a utility to extract, convert, transform and send to systemd-journal any kind of structured logs (including JSON and logfmt logs), similar to what promtail does for Loki.
  • More Netdata Functions: monitor containers and VMs, network interfaces, mount points, block devices, systemd units, systemd services, and more!
  • Netdata now logs to journal instead of log files and the results are amazing!

Release Highlights

Netdata beats Prometheus in all aspects

image

We tested Netdata and Prometheus at scale, both ingesting 2.7 million metrics per second. On the same workload, Netdata vs Prometheus needs:

  • 35% less CPU
  • 49% less RAM
  • 12% less bandwidth
  • 75% less disk space
  • 98% less disk I/O

Read the full performance comparison between Netdata and Prometheus.

To achieve these astonishing results, we made the following changes to Netdata since the previous release:

New SLOTS streaming protocol

A new streaming protocol, allows Netdata children and parents to share a common index of the metrics streamed, allowing the parents to receive metrics without consulting hashtables, reducing the overall overhead on parents by about 30%, without increasing the overhead on children (the children just number each metric).

The new protocol, called SLOTS, is automatically selected when both the child and the parent support it.

Streaming compression algorithms

Streaming now supports multiple compression algorithms. Previous Netdata releases supported only LZ4, which is known for its speed and average compression ratio. This release adds support for ZSTD, GZIP, and BROTLI.

ZSTD provides the best balance between compression ratio and CPU consumption, and therefore it is now the default.

The compression algorithms selection order can be configured on parents, in stream.conf, at the [API] section (parents), by setting compression algorithms order = zstd lz4 brotli gzip.

If you need to save most bandwidth at the expense of CPU utilization set this so that brotli or gzip appear first in the list, before zstd and lz4.

This also means that parents can now have a different compression order for each API key, allowing the use of different API keys based on the location of the child (i.e. children that are on billable egress bandwidth can use an API key that prefers the best compression, like brotli and gzip, while children on non-billable egress bandwidth can use an API key that prefers the best CPU utilization, like zstd or lz4).

Gorilla compression beta

Gorilla compression is a time series data compression technique, developed by Facebook for their time series database, Gorilla. It's particularly efficient for compressing data that changes incrementally over time, which is a common characteristic of time series data.

This release of Netdata includes an adaptation of Gorilla compression, which once enabled, provides 30% additional memory reduction to Netdata.

This was not ready when we compared Netdata and Prometheus, so the Gorilla compression benefits weren't accounted in the comparison. By enabling Gorilla compression, Netdata memory reduction is 70%+ compared to Prometheus.

To try Gorilla compression, edit netdata.conf and set at the [db] section, dbengine page type = gorilla.

Keep in mind that enabling Gorilla compression changes the dbegnine file format to Gorilla compressed metrics. This version of Netdata can read Gorilla-compressed data from dbengine even if Gorilla compression is not enabled, but previous versions of Netdata cannot read it. So, enable Gorilla, only if you don't plan to switch back to a previous version of Netdata.

Our plan is to have Gorilla compression enabled by default at the next release of Netdata.

systemd-journal logs

Our systemd-journal.plugin was already quite faster (10x) than journalctl, but still it was slow when the journal databases is huge (e.g. at journals centralization points where hundreds or thousands of nodes push their logs).

In this release, we introduce several changes to allow the plugin to work promptly in such environments.

Sampling and estimations

The biggest performance issue with systemd-journal logs is the query performance when dealing with huge logs databases.

To overcome this performance issue and provide prompt responses to queries, Netdata now uses the following strategy:

  1. The latest 500k log entries read from journal files work like before: we read all of them and all the values for all their fields, so that we can have accurate histograms and counters per field value at the filters.
  2. Once we hit the 500k log entries limit on a single query, we turn on sampling and estimations.
  3. Sampling distributes 500k more log entries to all the journal files to be read, so that the total log entries queried for their field values will be 1M. This means that if we have to read 100 files, 10k log entries per file will be sampled and 10k log entries more will be unsampled. Since files are usually spread over time, this provides a good sample across time.
  4. When the sampling threshold is hit, Netdata continues reading more log entries without querying the values of the fields. These log entries appear as [unsampled] at the histogram. We know these log entries are there, but the value counters on the field filters do not include them.
  5. When the [unsampled] threshold is hit, and we have read more than 1% of each file, Netdata estimates the number of entries that will be read from the file and skips the rest of it. This estimation appears as [estimated] in the histogram.

The above process allows Netdata to provide a histogram of the logs in a timely manner, even when the number of log entries in the visible timeframe is several dozen million.

A similar process is usually used by log management systems, including Grafana Loki and Elasticsearch. However, Netdata takes a much bigger sample of the data (other systems usually sample only a few thousand log entries, while Netdata usually samples more than a million) and the visualization allows exposing the exact sampling and estimations made at the histogram.

Image showing [unsampled] and [estimated] on a systemd journal system that collects about 10k nginx log entries per second:
image

Read more about journals query performance.

journals scan

On busy logs centralization servers, the number of journal files available in /var/log/journal/remote can grow significantly, slowing down directory listing (even ls -l is very slow on them).

To overcome this issue, Netdata now uses inotify events and sorts the files to be scanned from the latest to the oldest.

These changes allow Netdata to present the logs user interface for the most recent journals, immediately after a Netdata restart, while the journals database is scanned in the background.

Logs UI is now available when using Netdata docker images

We switched Netdata docker images from Alpine Linux to Debian, so that libsystemd will be available inside the docker image, allowing systemd-journal.plugin to be compiled and shipped with Netdata docker images.

Using Netdata docker images, Netdata can now query the host system journal files, while running inside the container.

MESSAGE_ID support

systemd-journal has a nice feature where certain events of common interest are given a specific MESSAGE_ID. Several such MESSAGE_IDs have been assigned to track common events, like coredumps, units start/stop events, VMs start/stop events, time changes, etc. In total, we found more than 50 total unique events that are tracked this way.

This version if systemd-journal.plugin automatically tracks and annotates these MESSAGE_IDs using their names allowing quick spotting of events of common interest.

This feature is available at the MESSAGE_ID field filter, at the right side of the dashboard.

log2journal, a new tool on your quiver for managing logs

log2journal is a new utility allowing the conversion of log files into structured systemd-journal log entries. This is currently in beta.

The utility allows processing logs like this:

tail -F /var/log/nginx/access.log |\
   log2journal -c nginx-combined |\
   systemd-cat-native

The above builds a basic pipeline for converting the access.log of an Nginx web server into structured log entries in the local systemd-journal.

  • tail is responsible for feeding the latest logs lines to log2journal. Multiple files can be specified and log2journal can also pick up the filename from tail and add it as a field to the journal logs.
  • log2journal extracts fields from the log lines it is fed with. This is a powerful tool that can read json and logfmt logs, but also extract fields using PCRE2 patterns from any log. It supports filtering, renaming, and rewriting rules using command line arguments or yaml configuration files. The output of log2journal is the standard Journal Export Format.
  • systemd-cat-native is another new Netdata utility, reading standard Journal Export Format entries, which are then sent to a local or remote systemd-journal system.

Read more here.

Image showing structured nginx logs into systemd-journal:
image

Netdata now logs to systemd-journal

The logging layer of Netdata has been rewritten, so that Netdata logs now go to the systemd-journal, in a namespace called netdata.

The obvious outcome is that now you can monitor Netdata logs, using Netdata's systemd-journal.plugin user interface and thanks to journal namespaces, this does not pollute the system logs. But this is just the beginning...

Netdata utilizes the MESSAGE_ID feature of systemd-journal to register:

  • all alert transitions
  • all alert notifications
  • all connections from Netdata children
  • all connections to Netdata parents

This means that the systemd-journal.plugin user interface, and journalctl can now be used to list all such events uniformly.

Screenshot of Netdata alert transitions in systemd-journals:
image

All Netdata logs are now structured. Netdata can also log in json or logfmt formats. We introduced a lot of new fields to track every aspect of Netdata, in a uniform and consistent way. Read more here.

Furthermore, we introduced a new tool called systemd-cat-native allowing any application or shell script to send structured logs to systemd-journal. Read more here.

Functions, power up your troubleshooting toolkit!

Several new Functions have been added to help us in our troubleshooting journeys. On top of processes, streaming and systemd-journal, we are leveraging the wide range of collectors and metrics Netdata has and bring data in a different visual representation.

The updated list can be found on our documentation here, and you can find a summary of the currently available functions with the corresponding CLI tool it relates to:

Function Description Alternative to CLI tools plugin - module
block-devices Disk I/O activity for all block devices, offering insights into both data transfer volume and operation performance. iostat proc
containers-vms Insights into the resource utilization of containers and QEMU virtual machines: CPU usage, memory consumption, disk I/O, and network traffic. docker stats, systemd-cgtop cgroups
ipmi-sensors Readings and status of IPMI sensors. ipmi-sensors freeipmi
mount-points Disk usage for each mount point, including used and available space, both in terms of percentage and actual bytes, as well as used and available inode counts. df diskspace
network interfaces Network traffic, packet drop rates, interface states, MTU, speed, and duplex mode for all network interfaces. bmon, bwm-ng proc
processes Real-time information about the system's resource usage, including CPU utilization, memory consumption, and disk IO for every running process. top, htop apps
systemd-journal Viewing, exploring and analyzing systemd journal logs. journalctl systemd-journal
systemd-list-units Information about all systemd units, including their active state, description, whether or not they are enabled, and more. systemctl list-units systemd-journal
systemd-services System resource utilization for all running systemd services: CPU, memory, and disk IO. systemd-cgtop cgroups
streaming Comprehensive overview of all Netdata children instances, offering detailed information about their status, replication completion time, and many more.

In the short-term, we will keep adding more (hopefully) helpful Functions but have longer-term plan where we will want to expand this functionality to potentially allow taking and storing snapshots of the results based on: triggered alerts, or periodical configuration.

In case you have suggestions we have a running GitHub Discussion open here.

New Alert Notification Integrations to Netdata Cloud

We've been working on adding more Alert Notification Integrations to Netdata Cloud and recently added the following new ones:

  • Amazon Simple Notification Service (Amazon SNS), and
  • Telegram

image

The full list of Alert Notification Integrations from Netdata Cloud can be found on our documentation here.

Acknowledgments

  • @ClaraCrazy for improving degraded adapters detection in python.d/megacli.
  • @thomasbeaudry for adding UPS selftest and status metrics to charts.d/apcupsd.
  • @watsonbox for adding LBAs written/read metrics to python.d/smartd_log.
  • @sepek for correcting an error in the "Change how long Netdata stores metrics" guide.
  • @seniorquico for fixing parsing and adding MAINT status metrics to python.d/haproxy.
  • @luisj1983 for correcting errors in the Health API documentation.
  • @andyundso for improving apps plugin by adding Erlang in apps_groups.conf.
  • @vobruba-martin for adding various improvements to go.d/mysql.

Contributions

Collectors

Improvements
  • Add more cases for megacli adapter degraded state (python.d/megacli) (#16522, @ClaraCrazy)
  • Improve estimations accuracy (systemd-journal.plugin) (#16467, @ktsaou)
  • Implement estimations (systemd-journal.plugin)(#16445, @ktsaou)
  • Improve startup time (systemd-journal.plugin) (#16443, @ktsaou)
  • Implement sampling (systemd-journal.plugin) (#16433, @ktsaou)
  • Add cgroup current pids metric (cgroups.plugin) (#16369, @ilyam8)
  • Add Ipmi-sensors function (freeipmi.plugin) (#16363, @ilyam8)
  • Add UPS status code metric (charts.d/apcupsd) (#16361, @thomasbeaudry)
  • Add Mount-points function (diskspace.plugin) (#16345, @ilyam8)
  • Add Block-devices function (proc/diskstats) (#16338, @ilyam8)
  • Add UsedBy field to Network-interfaces function (proc/proc_net_dev) (#16337, @ilyam8)
  • Add various improvements to Network-interfaces function (proc/proc_net_dev)(#16336, @ilyam8)
  • Add Network-interfaces function (proc/proc_net_dev) (#16334, @ilyam8)
  • Add Systemd-list-units function (systemd-journal.plugin) (#16318, @ktsaou)
  • Add Containers-vms function (cgroups.plugin) (#16314, @ktsaou)
  • Add UPS selftest status metric (charts.d/apcupsd) (#16286, @thomasbeaudry)
  • Add a configuration option to set private cleanup timeout (statsd.plugin) (#16269, @MrZammler)
  • Add container_device label to network interfaces (cgroups.plugin) (#16261, @ilyam8)
  • Add selecting multiple sources support (systemd-journal.plugin) (#16252, @ktsaou)
  • Add total LBAs written/read metrics (python.d/smartd_log) (#16245, @watsonbox)
  • Add Erlang to apps_groups.conf (apps.plugin) (#16231, @andyundso)
  • Add support for Proxmox vms/containers name resolution in Docker (cgroups.plugin) (#16193, @ilyam8)
  • Add nested JSON support to log parser (go.d/weblog) (#1416, @ilyam8)
Bug fixes

Bug Fixes

  • Fix configuration loading (charts.d.plugin ) (#16471, @ilyam8)
  • Fix an issue where systemd-journal would stop trying different socket paths after the first failure (systemd-journal.plugin) (#16458, @ktsaou)
  • Fix parsing PD without NCQ status (python.d/adaptec_raid) (#16400, @ilyam8)
  • Fix Systemd-list-units function expiration time (#16393, @ilyam8)
  • Fix lack of system.net when running inside LXC (#16364, @ilyam8)
  • Fix memory leak in Systemd-list-units function (systemd-journal.plugin) (#16333, @ktsaou)
  • Fix server status parsing and add MAINT status chart (python.d/haproxy) (#16253, @seniorquico)
Other

Other

  • Skip timestamp when logging to journald (python.d.plugin) (#16516, @ilyam8)
  • Mute stock jobs logging during check() (python.d.plugin) (#16515, @ilyam8)
  • Improvement performance of the plugin (systemd-journal.plugin) (#16509, @ktsaou)
  • Don't create runtime disk config by default (proc/diskspace, proc/diskstats) (#16503, @ilyam8)
  • Don't create runtime device config by default (proc/proc_net_dev) (#16501, @ilyam8)
  • Disable netdata monitoring section by default (#16480, @MrZammler)
  • Change apps oom and net charts order (ebpf.plugin) (#16395, @thiagoftsm)
  • Fix "differ in signedness" warn in cgroups plugin (#16391, @ilyam8)
  • Fix throttle_duration chart context (cgroups.plugin) (#16367, @ilyam8)
  • Hide summary columns in network and block devices functions (proc/diskstats, proc/proc_net_dev) (#16347, @ktsaou)
  • Fix crash when a container has no CPU/mem metrics in Containers-vms function (cgroups.plugin) (#16331, @ilyam8)
  • Add tcp v6 connect calls to Ebpf_socket function (ebpf.plugin) (#16316, @thiagoftsm)
  • Update journal sources once per minute (systemd-journal.plugin) (#16298, @ktsaou)
  • Minor updates and cleanup (systemd-journal.plugin) (#16267, @ktsaou)
  • Stop using deprecated distutils module (python.d.plugin) (#16259, @MrZammler)
  • Remove charts.d/nut (#16230, @ilyam8)
  • Don't log an error opening cgroup.procs/tasks if it does not exist (cgroups.plugin) (#16196, @ilyam8)
  • Improve exposing metrics by creating a chart for each app group (ebpf.plugin) (#16139, @thiagoftsm)
  • Skip timestamp when logging to journald (go.d.plugin) (#1418, @ilyam8)
  • Replace logger with structured logger (go.d.plugin) (#1418, @ilyam8)
  • Use SHOW REPLICA STATUS for MySQL v8.0.22+ (go.d/mysql) (#1392, @vobruba-martin)
  • Use performance_schema instead of information_schema for MySQL v8.0.22+ (go.d/mysql) (#1390, @vobruba-martin)

Packaging/Installation

All changes

Documentation

All changes

Other Notable Changes

Improvements
Bug Fixes
Other

Deprecation notice

Changed in this release

In accordance with our previous deprecation notice, the following items in this release have been changed:

Other unannounced changes:

  • Netdata internal metrics (Netdata Monitoring section) are disabled by default to reduce the overall data volume. Later we plan to enable only important internal metrics by default.

    Can be enabled in netdata.conf by uncommenting and changing no to yes:

    [plugins]
      # netdata monitoring = no
      # netdata monitoring extended = no
  • Logging

    • Logs format changed to logfmt.
    • Default logging destination changed to systemd-journal (systemd-only): logs are now sent to the "netdata" namespace in systemd-journal. Systemd-journal provides a centralized repository for all system logs, making it easier to manage and search for logs. To override the default behavior and continue using the file-based logging, refer to the netdata.conf file and make the necessary changes under the [logs] section.
    • File-based logging: error.log renamed to daemon.log.

Will be changed in the next release

  • To ensure seamless compatibility with future updates, we recommend transitioning from source-built installations to our distribution packages or static binaries. Starting with our next release, we will no longer guarantee compatibility when updating source-built installations. This change allows us to focus on enhancing the stability and feature delivery for the rest of our supported installation methods.

  • Gorilla compression will be enabled by default.

  • The Google Cloud Pub Sub and the AWS Kinesis exporters will be removed in the next release. Both of them were not maintained and were not used when building packages. Users can consult the exporting documentation for alternative exporters to use.

  • The database modes map and save will be removed in the next release. The dbengine database mode will be used to persist metrics on disk automatically.

  • Per-core CPU metrics will be disabled by default to reduce data volume. Summary (per-system) metrics are still collected. This change enhances performance and resource utilization. Disabled metrics:

    • cpu.cpu (utilization).
    • cpu.interrupts (all interrupts).
    • cpu.softirqs (software interrupts).
    • cpu.softnet_stat (software interrupts related to network receive work).
    • cpu.cpu_cstate_residency_time (idle states).

    Can be enabled in netdata.conf by uncommenting and changing no to yes:

    [plugin:proc:/proc/stat]
        # per cpu core utilization = no
        # cpu idle states = no
    
    [plugin:proc:/proc/interrupts]
        # interrupts per core = no
    
    [plugin:proc:/proc/softirqs]
        # interrupts per core = no
    
    [plugin:proc:/proc/net/softnet_stat]
        # softnet_stat per core = no
  • To optimize system performance, several eBPF.plugin modules have been disabled by default. While these modules provide valuable insights into system resource usage, they can also contribute to system overhead. They will expose metrics using Functions (run on demand and for a limited period of time). These modules include:

    • cachestat
    • fd
    • process
    • oomkill
    • shm
    • swap

Netdata Release Meetup

Join the Netdata team on the 11th of December at 16:30 UTC for the Netdata Release Meetup.

Together we’ll cover:

  • Release Highlights.
  • Acknowledgments.
  • Q&A with the community.

RSVP now - we look forward to meeting you.

Support options

As we grow, we stay committed to providing the best support ever seen from an open-source solution. Should you encounter an issue with any of the changes made in this release or any feature in the Netdata Agent, feel free to contact us through one of the following channels:

  • Netdata Learn: Find documentation, guides, and reference material for monitoring and troubleshooting your systems with Netdata.
  • GitHub Issues: Make use of the Netdata repository to report bugs or open a new feature request.
  • GitHub Discussions: Join the conversation around the Netdata development process and be a part of it.
  • Community Forums: Visit the Community Forums and contribute to the collaborative knowledge base.
  • Discord Server: Jump into the Netdata Discord and hang out with like-minded sysadmins, DevOps, SREs, and other troubleshooters. More than 1800 engineers are already using it!