The compactor is one of the most important components in Thanos. It is responsible for doing compaction, downsampling, and retention of the data in the object storage.
When your system contains a lot of block producers (Sidecar, Rule, Receiver, etc) or the scale is large, the compactor might not be able to keep up with the data producing rate and it falls behind, which causes a lot of backlogged work. This document will help you to troubleshoot the backlog compaction issue and how to scale the compactor.
running
#Before checking whether your compactor has backlog issues, please make sure compactors are running
. Running
here means compactors don’t halt.
If compactors halt, any compaction or downsample process stops so it is crucial to make sure no halt happens for compactor deployment.
thanos_compact_halted
metric will be set to 1 when halt happens. You can also find logs like below, telling that compactor is halting.
msg="critical error detected; halting" err="compaction failed...
There could be different reasons that caused the compactor to halt. A very common case is overlapping blocks. Please refer to our doc https://thanos.io/tip/operating/troubleshooting.md/#overlaps for more information.
Self-monitoring for the monitoring system is important. We highly recommend you set up the Thanos Grafana dashboards and alerts to monitor the Thanos components. Without self-monitoring, it is hard to detect the issue and fix the problems.
If you find these issues in your own Thanos deployment, then your compactor might be backlogged and needs to scale more:
thanos_compact_downsample_total
metric doesn’t increase.thanos_compact_iterations_total
metric doesn’t increase or the rate is very low.In the current implementation, the compactor will perform compaction, downsampling, and retention phases in order, which means that if the compaction work is not finished, the downsampling and retention phase won’t start. So that’s why you will find symptom 2 and 3 mentioned above.
For symptom 4, thanos_compact_iterations_total
metric doesn’t increase means that the Thanos compactor is currently working on the current compaction iteration and cannot finish it in a long time. It is very similar to the case of a message queue. The producers are the components that upload blocks to your object storage. And the compactor is the consumer to deal with the jobs from producers. If the data production rate is higher than the processing rate of the consumer, then the compactor will fall behind.
Since the Thanos v0.24 release, four new metrics thanos_compact_todo_compactions
, thanos_compact_todo_compaction_blocks
, thanos_compact_todo_downsample_blocks
and thanos_compact_todo_deletion_blocks
are added to show the compaction, downsampling and retention progress and backlog.
thanos_compact_todo_compactions
: The number of planned compactions. thanos_compact_todo_compaction_blocks
: The number of blocks that are planned to be compacted. thanos_compact_todo_downsample_blocks
: The number of blocks that are queued to be downsampled. thanos_compact_todo_deletion_blocks
: The number of blocks that are queued for retention.
To use these metrics, for example you can use sum(thanos_compact_todo_compactions)
to get the overall compaction backlog or use sum(thanos_compact_todo_compactions) by (group)
to get which compaction group is the slowest one.
This feature works by syncing block metadata from the object storage every 5 minutes by default and then simulating the compaction planning process to calculate how much work needs to be done. You can change the default 5m interval by setting compact.progress-interval
flag or disable it by setting compact.progress-interval=0
.
To prevent the compactor from falling behind, you can scale the compactor.
To scale the compactor vertically, you can give it more CPU and memory resources. It also has two flags compact.concurrency
and downsample.concurrency
to control the number of workers to do compaction and downsampling.
To scale the compactor horizontally, you can run multiple compactor instances with different min-time
and max-time
flags so that each compactor will only work on the blocks that are within the time range.
Another way of horizontal scaling is to run multiple compactors by sharding. You can use the relabel config to shard the compactors by external labels like cluster
so that blocks with the same external label will be processed by the same compactor.
There are some bucket tools that can help you to troubleshoot and solve this issue.
You can use bucket ls
, bucket inspect
and bucket web
UI to view your current blocks.
If a lot of blocks are older than the retention period and the compactor is not performing retention, then to clean them up, you can do the following:
thanos tools bucket retention
command to perform retention on your blocks directly. This step won’t delete your blocks right away. It just adds deletion markers to the blocks for retention.thanos tools bucket cleanup --delete-delay=0s
to delete those marked blocks right away.Found a typo, inconsistency or missing information in our docs? Help us to improve Thanos documentation by proposing a fix on GitHub here ❤️