Block overlap: Set of blocks with exactly the same external labels in meta.json and for the same time or overlapping time period.
Thanos is designed to never end up with overlapped blocks. This means that (uncontrolled) block overlap should never happen in a healthy and well configured Thanos system. That’s why there is no automatic repair for this. Since it’s an unexpected incident: * All reader components like Store Gateway will handle this gracefully (overlapped samples will be deduplicated). * Thanos compactor will stop all activities and HALT or crash (with metric and will error log). This is because it cannot perform compactions and downsampling. In the overlap situation, we know something unexpected happened (e.g manual block upload, some malformed data etc), so it’s safer to stop or crash loop (it’s configurable).
Let’s take an example:
msg="critical error detected; halting" err="compaction failed: compaction: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1555128000000, maxt: 1555135200000, range: 2h0m0s, blocks: 2]: <ulid: 01D94ZRM050JQK6NDYNVBNR6WQ, mint: 1555128000000, maxt: 1555135200000, range: 2h0m0s>, <ulid: 01D8AQXTF2X914S419TYTD4P5B, mint: 1555128000000, maxt: 1555135200000, range: 2h0m0s>
In this halted example, we can read that compactor detected 2 overlapped blocks. What’s interesting is that those two blocks look like they are “similar”. They are exactly for the same period of time. This might mean that potential reasons are:
Checking producers log for such ULID, and checking meta.json (e.g if sample stats are the same or not) helps. Checksum the index and chunks files as well to reveal if data is exactly the same, thus ok to be removed manually. You may find
scripts/thanos-block.jq script useful when inspecting
meta.json files, as it translates timestamps to human-readable form.
level=warn ts=2020-04-18T03:07:00.512902927Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="request flags against http://localhost:9090/api/v1/status/config: Get \"http://localhost:9090/api/v1/status/config\": dial tcp 127.0.0.1:9090: connect: connection refused"
connection_refusedstates that there is no server running in the
localhost:9090, which is the address for prometheus in this case.
level=info ts=2020-04-18T03:16:32.158536285Z caller=grpc.go:137 service=gRPC/server component=sidecar msg="internal server shutdown" err="no external labels configured on Prometheus server, uniquely identifying external labels must be configured"
external_labelsfor further processing. So make sure that the
external_labelsare not empty and globally unique in the prometheus config file. A possible example -
global: external_labels: cluster: eu1 replica: 0