Written by Thibault Mangé (https://github.com/thibaultmg) and published on Sep 16, 2024
Thanos is a sophisticated distributed system with a broad range of capabilities, and with that comes a certain level of configuration complexity. In this series of articles, we will take a deep dive into the lifecycle of a sample within Thanos, tracking its journey from initial ingestion to final retrieval. Our focus will be to explain Thanos’s critical internal mechanisms and highlight the essential configurations for each component, guiding you toward achieving your desired operational results. We will be covering the following Thanos components:
The objective of this series of articles is to make Thanos more accessible to new users, helping alleviate any initial apprehensions. We will also assume that the working environment is Kubernetes. Given the extensive ground to cover, our goal is to remain concise throughout this exploration.
Before diving deeper, please check the annexes to clarify some essential terminology. If you are already familiar with these concepts, feel free to skip ahead.
The sample usually originates from a Prometheus instance that is scraping targets in a cluster. There are two possible scenarios:
Also, bear in mind that if adding Thanos for collecting your cluster’s metrics removes the need for a full fledged local Prometheus (with querying and alerting), you can save some resources by using the Prometheus Agent mode. In this configuration, it will only scrape the targets and forward the data to the Thanos system.
The following diagram illustrates the two scenarios:
Comparing the two deployment modes, the Sidecar Mode is generally preferable due to its simpler configuration and fewer moving parts. However, if this isn’t possible, opt for the Receive Mode. Bear in mind, this mode requires careful configuration to ensure high availability, scalability, and durability. It adds another layer of indirection and comes with the overhead of operating the additional component.
Let’s start with our first Thanos component, the Receive or Receiver, the entry point to the system. It was introduced with this proposal. This component facilitates the ingestion of metrics from multiple clients, eliminating the need for close integration with the clients’ Prometheus deployments.
Thanos Receive exposes a remote-write endpoint (see Prometheus remote-write) that Prometheus servers can use to transmit metrics. The only prerequisite on the client side is to configure the remote write endpoint on each Prometheus server, a feature natively supported by Prometheus.
On the Receive component, the remote write endpoint is configured with the --remote-write.address
flag. You can also configure TLS options using other --remote-write.*
flags. You can see the full list of the Receiver flags here.
The remote-write protocol is based on HTTP POST requests. The payload consists of a protobuf message containing a list of time-series samples and labels. Generally, a payload contains at most one sample per time series and spans numerous time series. Metrics are typically scraped every 15 seconds, with a maximum remote write delay of 5 seconds to minimize latency, from scraping to query availability on the receiver.
The Prometheus remote write configuration offers various parameters to tailor the connection specifications, parallelism, and payload properties (compression, batch size, etc.). While these may seem like implementation details for Prometheus, understanding them is essential for optimizing ingestion, as they form a sensitive part of the system.
From an implementation standpoint, the key idea is to read directly from the TSDB WAL (Write Ahead Log), a simple mechanism commonly used by databases to ensure data durability. If you wish to delve deeper into the TSDB WAL, check out this great article. Once samples are extracted from the WAL, they are aggregated into parallel queues (shards) as remote-write payloads. When a queue reaches its limit or a maximum timeout is reached, the remote-write client stops reading the WAL and dispatches the data. The cycle continues. The parallelism is defined by the number of shards, their number is dynamically optimized. More insights on Prometheus’s remote write can be found in the documentation. You can also find troubleshooting tips on Grafana’s blog.
The following diagram illustrates the impacts of each parameter on the remote write protocol:
Key Points to Consider:
batch_send_deadline
should be set to around 5s to minimize latency. This timeframe strikes a balance between minimizing latency and avoiding excessive request frequency that could burden the Receiver. While a 5-second delay might seem substantial in critical alert scenarios, it is generally acceptable considering the typical resolution time for most issues.min_backoff
should ideally be no less than 250 milliseconds, and the max_backoff
should be at least 10 seconds. These settings help prevent Receiver overload, particularly in situations like system restarts, by controlling the rate of data sending.In scenarios where you have limited control over client configurations, it becomes essential to shield the Receive component from potential misuse or overload. The Receive component includes several configuration options designed for this purpose, comprehensively detailed in the official documentation. Below is a diagram illustrating the impact of these configuration settings:
When implementing a topology with separate router and ingestor roles (as we will see later), these limits should be enforced at the router level.
Key points to consider:
series_limit
and samples_limit
tend to be functionally equivalent. However, in scenarios where the remote writer is recovering from downtime, the samples_limit
may become more restrictive, as the payload might include multiple samples for the same series.Considering these aspects, it is important to carefully monitor and adjust these limits. While they serve as necessary safeguards, overly restrictive settings can inadvertently lead to data loss.
Relying on a single instance of Thanos Receive is not sufficient for two main reasons:
To achieve high availability, it is necessary to deploy multiple Receive replicas. However, it is not just about having more instances; it is crucial to maintain consistency in sample ingestion. In other words, samples from a given time series should always be ingested by the same Receive instance. This is necessary for optimized operations. When this is not achieved, it imposes a higher load on other operations such as compacting the data or querying the data.
To that effect, you guessed it, the Receive component uses a hashring! With the hashring, every Receive participant knows and agrees on who must ingest which sample. When clients send data, they connect to any Receive instance, which then routes the data to the correct instances based on the hashring. This is why the Receive component is also known as the IngestorRouter.
Receive instances use a gossip protocol to maintain a consistent view of the hashring, requiring inter-instance communication via a configured HTTP server (--http-address
flag).
There are two possible hashrings:
The hashring algorithm is configured with the --receive.hashrings-algorithm
flag. You can use the Thanos Receive Controller to automate the management of the hashring.
Key points to consider:
hashmod
algorithm is a good choice. It is more efficient in evenly distributing the load. Otherwise, ketama
is recommended for its operational benefits.hashmod
algorithm will be shorter as the amount of data to upload to the object storage is smaller. Also, when using the ketama
algorithm, if a node falls, its requests are directly redistributed to the remaining nodes. This could cause them to be overwhelmed if there are too few of them and result in a downtime. With more nodes, the added load represents a smaller fraction of the total load.--receive.limits-config
flag to limit the amount of data that can be received. This will prevent the Receive from being overwhelmed by the catch up.For clients requiring high data durability, the --receive.replication-factor
flag ensures data duplication across multiple receivers. When set to n, it will only reply with a successful processing response to the client once it has duplicated the data to n-1
other receivers. Additionally, an external replicas label can be added to each Receive instance (--label
flag) to mark replicated data. This setup increases data resilience but also expands the data footprint.
For even greater durability, replication can take into account the availability zones of the Receive instances. It will ensure that data is replicated to instances in different availability zones, reducing the risk of data loss in case of a zone failure. This is however only supported with the ketama
algorithm.
Beyond the increased storage cost of replication, another downside is the increased load on the Receive instances that must now forward a given request to multiple nodes, according to the time series labels. Nodes receiving the first replica then must forward the series to the next Receive node until the replication factor is satisfied. This multiplies the internodes communication, especially with big hashrings.
A new deployment topology was proposed, separating the router and ingestor roles. The hashring configuration is read by the routers, which will direct each time series to the appropriate ingestor and its replicas. This role separation provides some important benefits:
The Receive instance behaves in the following way:
receive.local-endpoint
are set, it acts as a RouterIngestor. This last flag enables the router to identify itself in the hashring as an ingestor and ingest the data when appropriate.To enhance reliability in data ingestion, Thanos Receive supports out-of-order samples.
Samples are ingested into the Receiver’s TSDB, which has strict requirements for the order of timestamps:
When these requirements are not met, the samples are dropped, and an out-of-order warning is logged. However, there are scenarios where out-of-order samples may occur, often because of clients’ misconfigurations or delayed remote write requests, which can cause samples to arrive out of order depending on the remote write implementation. Additional examples at the Prometheus level can be found in this article.
As you are not necessarily in control of your clients’ setups, you may want to increase resilience against these issues. Support for out-of-order samples has been implemented for the TSDB. This feature can be enabled with the tsdb.out-of-order.time-window
flag on the Receiver. The downsides are:
--compact.enable-vertical-compaction
flag is enabled on the compactor to manage these overlapping blocks. We will cover this in more detail in the next article.Additionally, consider setting the tsdb.too-far-in-future.time-window
flag to a value higher than the default 0s to account for possible clock drifts between clients and the Receiver.
In this first part, we have covered the initial steps of the sample lifecycle in Thanos, focusing on the ingestion process. We have explored the remote write protocol, the Receive component, and the critical configurations needed to ensure high availability and durability. Now, our sample is safely ingested and stored in the system. In the next part, we will continue following our sample’s journey, delving into the data management and querying processes.
See the full list of articles in this series:
Sample: A sample in Prometheus represents a single data point, capturing a measurement of a specific system aspect or property at a given moment. It is the fundamental unit of data in Prometheus, reflecting real-time system states.
Labels: very sample in Prometheus is tagged with labels, which are key-value pairs that add context and metadata. These labels typically include:
External labels: External labels are appended by the scraping or receiving component (like a Prometheus server or Thanos Receive). They enable:
meta.json
file of the block created by Thanos, these labels are used by the compactor and the store to shard blocks processing effectively.Series or Time Series: In the context of monitoring, a Series, which is a more generic term is necessarily a time series. A series is defined by a unique set of label-value combinations. For instance:
http_requests_total{method="GET", handler="/users", status="200"}
^ ^
Series name (label `__name__`) Labels (key=value format)
In this example, http_requests_total is a specific label (__name__
). The unique combination of labels creates a distinct series. Prometheus scrapes these series, attaching timestamps to each sample, thereby forming a dynamic time series.
For our discussion, samples can be of various types, but we will treat them as simple integers for simplicity.
The following diagram illustrates the relationship between samples, labels and series:
Thanos adopts its storage architecture from Prometheus, utilizing the TSDB (Time Series Database) file format as its foundation. Let’s review some key terminology that is needed to understand some of the configuration options.
Samples from a given time series are first aggregated into small chunks. The storage format of a chunk is highly compressed (see documentation). Accessing a given sample of the chunk requires decoding all preceding values stored in this chunk. This is why chunks hold up to 120 samples, a number chosen to strike a balance between compression benefits and the performance of reading data.
Chunks are created over time for each time series. As time progresses, these chunks are assembled into chunk files. Each chunk file, encapsulating chunks from various time series, is limited to 512MiB to manage memory usage effectively during read operations. Initially, these files cover a span of two hours and are subsequently organized into a larger entity known as a block.
A block is a directory containing the chunk files in a specific time range, an index and some metadata. The two-hour duration for initial blocks is chosen for optimizing factors like storage efficiency and read performance. Over time, these two-hour blocks undergo horizontal compaction by the compactor, merging them into larger blocks. This process is designed to optimize long-term storage by extending the time period each block covers.
The following diagram illustrates the relationship between chunks, chunk files and blocks: