Our goals here are:
Thanos performs similar operations as Prometheus do on TSDB blocks with the following differences:
Moving from lock-based logic to coordination free and from strongly consistent local filesystem to potentially eventually consistent remote simplified “filesystem” in form of Object Storage API, causes additional cases that we need to consider in Thanos system, like:
Currently, we have only basic safeguards against those in form of syncDelay
. That helps with operations between sidecar and compactor but does not fix inter-compactor logic in case of eventual consistent storage or partial upload failures.
We also partially assumed strong consistency for object storage, because GCS has strong consistency and S3 has it as well, but with some caveats that we met.
However with increasing number of other providers (ceph, hdfs, COS, Azure, Swift) with no (or unknown) strong consistency guarantee this assumption needs to go away.
We have an explicit API for Series fetching in form of gRPC StoreAPI, but we lack the verbose API for other side: the object storage format with assumptions on behavior that will cover partial blocks, eventual consistency and syncing requirements, which is a goal for this proposal.
We propose defining those set of rules to solve A, B and C goals that will define the object storage format API (and how to operate with those objects):
syncDelay
) from creation time. Block id (ULID
) is used to determine creation time (if ulid.Now()-id.Time() < uint64(c.syncDelay/time.Millisecond)
). Block fresher than 15 minutes is assumed still in upload (Fresh
). This can be overridden by ignore_delay
field in meta.json.meta.json
file within block have to be uploaded as the last one in the block folder.syncDelay
), if there is no fully loadable meta.json
, we assume block is malformed, partially uploaded or marked for deletion and can be removed by compactor. [Compactor change needed]HealthyOverlapped
state). [Compactor/Repair change needed]meta.json
file. All components will exclude this block and compactor will do eventual deletion assuming the block is partially uploaded. [Compactor change needed]deleteDelay
) before deleting the whole To Delete
block. [Compactor change needed]Below diagram shows the block states defined by our rules.
To deal with A and B, so overall eventual consistency, we will leverage existing feature of metric format (TSDB) which is block immutability. In object storage format, a block is a directory named with ULID
that contains chunks
dir with one or more chunk files (each not bigger than 512MB), index
and meta.json
files. To ensure every component has the same state of metrics we create first strict rule:
1 . A block, once uploaded is never edited (no file within ULID directory is removed, edited or added).
With this rule alone we restrict bucket operations only to:
As those operations are not atomic itself, partial blocks or inconsistent block view may occur for short time (read-after-write consistency). To mitigate this we introduce second rule:
2 . A block is assumed ready to be read only after 15 minutes (configured by
syncDelay
) from creation time. Block id (ULID
) is used to determine creation time (if ulid.Now()-id.Time() < uint64(c.syncDelay/time.Millisecond)
). Block fresher than 15 minutes is assumed still in upload (Fresh
). This can be overridden byignore_delay
field in meta.json.
Readers may attempt to load block’s files (e.g meta.json to grab external labels), but should be aware that those files can be partially uploaded.
The delay is not only for block eventual consistency. Even if everyone can read block properly, we are not aware if all readers loaded the block properly.
To ensure we have block fully uploaded we define:
3 . TSDB
meta.json
file within block have to be uploaded as the last one in the block folder.
Currently and as proposed meta.json
is our integrity marker (it is uploaded as the last one). It matters even when we don’t have strong consistency, because it helps us detect the case of upload
crashed in between of upload. We only upload meta.json after successful uploads of rest of the files for the block.
With those rule one could say “why to wait 15 minutes if we can just check if meta.json is parsable”. In eventual consistency world we cannot just check meta.json as even when it was uploaded at the end, other files might be still unavailable for a reader.
Thanks of meta.json being uploaded at the end we can form 4th rule:
4 . After 15 minutes (or time configured by
syncDelay
), if there is no fully loadablemeta.json
, we assume block is malformed, partially uploaded or marked for deletion and can be removed by compactor. [Compactor change needed]
This is to detect partial uploads caused by for example compactor being rolled out or crashed. Other writers managed by shipper
package relies on fact that blocks are persisted on disk so the upload operation can be retried.
Similar to writers, readers are excluding all blocks fresher than 15m or older but without meta.json
. Overall those 4 rules solved issue C) and A), B) in terms of block upload.
However, sometimes we need to change blocks for reasons, like:
delete series
operations.Since all blocks are immutable, such changes are performed by rewriting block (or compacting multiple) into new block. This makes removal
operations a first citizen object lifecycle. The main difficult point during this process is to make sure all readers are synced and aware that new block is ready in place of the old one(s), so writer can remove old block(s). In eventual consistency system we don’t have that information without additional coordination. To mitigate we propose 2 new rules. All to support lazy deletion:
5 . Overlaps (in external labels and time dimensions) with the same data are allowed and should be handled by readers gracefully. [Compactor change needed]
This should be the case currently for all components except compactor. For example store having overlapping blocks with even the same data or sidecar and store exposing same data via StoreAPI is handled gracefully by frontend.
The major difference here is compactor. Currently compactor does not allow overlap. It immediately halts compactor and waits for manual actions. This is on purpose to not allow block malformation by blocks which should be not compacted together.
“Overlaps with the same data” is crucial here as valid overlaps are when:
Newer block from two or more overlapping blocks fully submatches the source blocks of older blocks. Older blocks can be then ignored.
The word fully is crucial.
We will determine overlaps in data by using Full Overlap Block Detection Algorithm which is defined as following: Consider the below diagram which shows different tsdb blocks:
In the first example, we see that the block F is a compacted block containing data of blocks ABCD. The blocks ABCD are source blocks that contain data of ABCD blocks. Since block F contains data for all blocks that blocks ABCD contain, the block F completely overlaps with blocks ABCD, we can safely delete ABCD.
In the second example, the block F is a compacted block containing data of blocks ABCD. The blocks ABCD are blocks contains data of ABCDE blocks. Since block F doesn’t contain data related to block E, the block F cannot be safely deleted.
Having this kind of overlap support, we can delay deletion by forming 6th rule:
6 . Adding “replacement” or “overlapped” block causes old blocks to be scheduled for removal when new block is ready by compactor (
HealthyOverlapped
state). [Compactor/Repair change needed]
This rule ensures that we can detect when to delete block, and we delete it only once new block is ready, so when all readers already loaded new block (e.g to syncDelay).
This also means that repairing block gracefully (rewriting for deleting series or other reason that does not need to be unblocked), is as easy as uploading new block. Compactor will delete older overlapped block eventually.
There is caveat for rule 2nd rule (block being ready) for compactor. In case of compaction process we compact and we want to be aware of this block later on. Because of eventual consistent storage, we cannot, so we potentially have uploaded a block that is “hidden” for first 15 minutes. This is bad, as compactor will see the source blocks (see 6th rule why we don’t delete those immediately) and will compact same blocks for next 15 min (until we can spot the uploaded block). To mitigate this compactor:
In the worst case it is possible that compactor will compact twice same source blocks. This will be handled gracefully because of rule 5th and 6th (valid overlaps are ok and older submatching will be deleted anyway later on)
To match partial upload safeguards we want to delete block in reverse order:
7 . To schedule delete operation, delete
meta.json
file. All components will exclude this block and compactor will do eventual deletion assuming the block is partially uploaded. [Compactor change needed]
We schedule deletions instead of doing them straight away for 3 reasons:
Along with this, we also store information about when the block was scheduled to be deleted so that it can be deleted at a later point in time. To do so, we create a file deletion-mark.json
where we store information about when the block was scheduled to be deleted. Storing the information in a file makes it resilient to failures that result in restarts.
There might be exception for malformed blocks that blocks compaction or reader operations. Since we may need to unblock the system immediately the block can be forcibly removed meaning that query failures may occur (reader loaded block, but not aware block was deleted).
8 . Compactor waits minimum 15m (
deleteDelay
) before deleting the wholeTo Delete
block. [Compactor change needed]
This is to make sure we don’t forcibly remove block which is still loaded on reader side.
We check the deletion-mark.json
file to identify if the block has to be deleted. After 15 minutes of marking the block to be deleted, we are ok to delete the whole block directory.
What if someone will set too low syncDelay
? And upload itself will take more than syncDelay
time. Block will be assumed as ToDelete
state and will be removed. Other use case is when Prometheus is partitioned/misconfigured from object storage for longer time (hours?). Once up it will upload all blocks with ULID older than time.Now-sync-delay. How to ensure that will not happen?
deleteDelay
it might have still some time to recover, we might rely on that.upload timeout
has to smaller than syncDelay
.What if one block is malformed. Readers cannot handle it and crashes. How repair procedure will work? We can have repair process that can download block locally, rewrite it and fix it and upload. Problem is that it will take syncDelay
time to appear in system. Since we are blocked, we need to make the block available immediately, ignoring eventual consistency limitations. Potential solutions:
ignore_delay
option and avoid Fresh state as possible. Eventual consistency issue may hit us.Do we allow repair procedure when compactor is running? This is quite unsafe as compaction/downsamlping operation might be in progress so repair procedure might be avoided. Turning off compactor might be tricky as well in potential HA mode. Maybe we want to introduce “locking” some external labels from compactor operations. If yes, how to do it in eventual consistency system?
syncDelay
is implemented on all components (handling fresh blocks) with reasonable minimum value for syncDelay (e.g 5 minutes?)Healthy/HealthyOverlapped
-> toDelete
state change.HealthyOverlapped
block based on source ULIDs from meta file.ScheduleDelete
that will delete only meta.jsondeleteDelay
support for compactor on apply on toDelete
blocks.shipper
to avoid unnecessary duplicated compactions/downsamplingsignore_delay
parameter that will ignore syncDelay for sudden repairs.HealthyOverlapped
to reduce resource consumption for store gateway.As mentioned in #Motivation section this would block community adoption as only a few of object storages has strong or even clearly stated consistency guarantees.
By not removing any object, we may simplify problem. However we need hard deletions, because they are:
We may want to add some file to commit and mark integrity of the block. That file may hold some information like files we expect to have in the directory and maybe some hashes of those.
This was rejected for we can reuse existing meta.json as “integrity” marker (e.g if after 15m we don’t have meta file, it is partially uploaded and should be removed).
Additional information like expected number of files can be added later if needed, however with rule no 3. we should not need that.
The main drawback of this is changing or required additional things from TSDB block format which increases complexity.
As this would help with mentioned issues, Thanos aims for being coordination free and having no additional system dependency to stay operationally simple.
Currently, compactor requires the blocks to appear in order of creation. That might be not the case for eventual consistent storage. In those cases compactor assumes gap (no block being produced) and compacts with the mentioned gap. Once the new block is actually available for reading it causes compactor to halt due to overlap. Vertical compaction might help here as can handle those overlaps gracefully, by compacting those together. This however:
As vertical compaction might be something we can support in future, it clearly does not help with problems we stated here.
deleteDelay
We may avoid introducing additional state, by adding mitigation for not having delayed removal.
For example for retention apply or manual block deletion, when we would delete block immediately we can have query failures (object does not exists for getrange operations)
We can avoid it by:
retention
flag to readers (compactor have it, but store gateway does not). Cons: