This short document describes the motivation and the design of a new format that is meant to replace
index-cache.json we have currently.
We also propose renaming index-cache to index-header due to name collision with index cache for postings and series.
Currently the Store Gateway component has to be aware of all the blocks (modulo sharding & time partitioning) in the bucket. For each block that we want to serve metrics from, the Store Gateway has to have that block
synced in order to:
sync process includes:
The current, mentioned index-cache.json holds block’s:
There are few problems with this approach:
index-cache.jsonsize of memory.
1, 2 & 3 contributes to Store Gateway startup being slow and resource consuming: https://github.com/thanos-io/thanos/issues/448
This design is trying to address those four problems.
index-headerat query time directly from bucket is not a goal of this proposal.
index-cache.jsoninto Store GW.
TSDB index is in binary format.
To allow reduced resource consumption and effort when building (1), (2), “index-header” for blocks we plan to reuse similar format for sections like symbols, label indices and posting offsets in separate the file called
index-header that will replace currently existing
The process for building this will be as follows:
With that effort building time and resources should be compared with downloading the prebuilt
index-header from the bucket. This allows us to reduce complexity of the system as we don’t need to cache that in object storage by compactor.
Thanks to this format we can reuse most of the FileReader code to load file.
Thanos will build/compose all index-headers on the startup for now, however in theory we can load and build those blocks on demand. Given the minimal memory that each loaded block should take now, this is described as Future Work
While idea of combing different pieces of TSDB index as our index-header is great, unfortunately we heavily rely on information about size of each posting represented as
We need to know apriori how to partition and how many bytes we need to fetch from the storage to get each posting: https://github.com/thanos-io/thanos/blob/7e11afe64af0c096743a3de8a594616abf52be45/pkg/store/bucket.go#L1567
To calculate those sizes we use
indexr.PostingsRanges() which scans through
posting section of the TSDB index. Having to fetch whole postings section just to get size of each posting makes this proposal less valuable as we still need to download big part of index and traverse through it instead of what we propose in #Proposal
For series we don’t know the exact size either, however we estimate max size of each series to be 64*1024. It depends on sane number of label pairs and chunks per series. We have really only one potential case when this was too low: https://github.com/thanos-io/thanos/issues/552. Decision about series this was made here: https://github.com/thanos-io/thanos/issues/146
For postings it’s more tricky as it depends on number of series in which given label pair exists. For worst case it can be even million label-pair for popular pairs like
We have few options:
PostingOffset: Unlikely to happen as not needed by Prometheus.
index-headerwithout downloading full TSDB index: This option invalidates this proposal.
However there is one that this proposal aims for:
Users care the most about surprising spikes in memory usage. Currently the Store Gateway caches the whole index-cache.json. While it’s silly to do so for all blocks, this will happen anyway if query will span over large number of blocks and series. This means that while baseline memory will be reduced, baseline vs request memory difference will be even more noticeable.
This tradeoff is acceptable, due to total memory used for all operation should be much smaller. Additionally such query spanning all of the blocks and series are unlikely and should be blocked by simple
How to micro-benchmark such change? mmaping is outside of Go run time, which counts allocations etc.
mmap adds a lot of complexity and confusion especially around monitoring it’s memory usage as it does not appear on Go profiles.
While mmap is great for random access against a big file, in fact the current implementation of the FileReader reallocates symbols, offsets and label name=value pairs while reading. This really defies the purpose of mmap as we want to combine all info in a dense few sequential sections of index-header binary format, this file will be mostly read sequentially. Still, label values can be accessed randomly, that’s why we propose to start with mmap straight away.
After initial startup with persistent disk, next startup should be quick due to cached files on disk for old blocks. Only new one will be iterated on. However still initial startup and adhoc syncs can be problematic for example: auto scaling. To adapt to high load you want component to start up quickly.
Currently all of those methods are getting labels values and label names across all blocks in the system.
This will load all blocks to the system on every such call.
We have couple of options:
index-headerfrom TSDB index.
index-headerfile from pieces of TSDB index.
We can maintain a pool with limited number of
index-header files loaded at the time in Store Gateway. With an LRU logic we should be able to deduce which of blocks should be unloaded and left on the disk.
This proposes following algorithm:
Both X and Y being configurable. From UX perspective it would be nice to set configurable memory limit for loaded blocks.
index-headeron demand from bucket at query time.
index-headeron startup at all, just lazy background job if needed.