What’s new in Prometheus monitoring for Docker and Kubernetes

By Serdar Yegulalp Nov 9th 2017
What’s new in Prometheus monitoring for Docker and Kubernetes

Prometheus 2.0 container monitoring system arrives with a more efficient time-series data storage format, better handling of stale event data, and snapshot-based database backups

Prometheus, the open source monitoring system for Docker-style containers running in cloud architectures, has formally released a 2.0 version with major architectural changes to improve its performance.

Among the changes that have landed since the release of version 1.6 earlier this year:

  • An entirely new storage format for the data accumulated by Prometheus.
  • A new way for Prometheus to handle “staleness,” i.e. problems resulting when data reported by Prometheus doesn’t match the actual state of the cluster.
  • A method for taking efficient snapshot backups of the entire database.

Most of the changes shouldn’t force experienced Prometheus users to retool their environments. The new features are meant to work under the hood, without significantly altering workflow, although there are a few breaking changes (documented here).

New in Prometheus 2.0: More efficient time-series database storage format

Under the hood, Prometheus is a time-series database—a system for gathering statistics about running containers and storing them in a way that’s indexed by timestamps. Because time-series data arrives at high speed and from many sources, it’s hard to aggregate properly. Writing the data to disk becomes a major bottleneck.

Prometheus 2.0 addresses this by partitioning the data by ranges of time, rather than by data source. The result is far less CPU and disk usage, more manageable latency for queries, and a better mechanism for mopping up data that isn’t needed anymore.

Again, the vast majority of Prometheus deployments won’t need to do anything to leverage these improvements, other than deploy Prometheus 2.0.

New in Prometheus 2.0: Better handling of stale data from containers

Another problem Prometheus users have observed is how the system has trouble handling stale data. For instance, users sometimes get bombarded with alerts about a service being down, even after that service has already come back up. Another problem is if a resource disappears from monitoring and then reappears within a certain timefrane, it can end up being counted twice and produce misleading statistics.

Prometheus 2.0 deals with this by having more explicit rules for handling events from sources that have gone stale. The logic for handling this is surprisingly complex (see this slide deck for details), but the end user doesn’t have to deal with the vast majority of the details.

New in Prometheus 2.0: Full database snapshot backups

The new storage engine in Prometheus 2.0 makes it possible to take efficient point-in-time snapshots of the database. Triggering a snapshot is as simple as hitting a specific Prometheus API endpoint.

According to Prometheus developer Fabian Reinartz, those snapshots are small—a fractional percentage of the size of the whole database—and can be copied somewhere for safekeeping. “On disk failure or other scenarios, new Prometheus servers can be started with the snapshot backup with minimal data loss,” says Reinartz.

RECOMMENDEDPartner Content