#3147 Querier: Add query.metadata.default-time-range flag to specify the default metadata time range duration for retrieving labels through Labels and Series API when the range parameters are not specified. The zero value means range covers the time since the beginning.
#3207 Query Frontend: Add cache-compression-type flag to use compression in the query frontend cache.
#2665 Swift: Fix issue with missing Content-Type HTTP headers.
#2800 Query: Fix handling of --web.external-prefix and --web.route-prefix.
#2834 Query: Fix rendered JSON state value for rules and alerts should be in lowercase.
#2866 Receive, Querier: Fixed leaks on receive and querier Store API Series, which were leaking on errors.
#2937 Receive: Fixing auto-configuration of --receive.local-endpoint.
#2895 Compact: Fix increment of thanos_compact_downsample_total metric for downsample of 5m resolution blocks.
#2858 Store: Fix --store.grpc.series-sample-limit implementation. The limit is now applied to the sum of all samples fetched across all queried blocks via a single Series call, instead of applying it individually to each block.
#2936 Compact: Fix ReplicaLabelRemover panic when replicaLabels are not specified.
#2956 Store: Fix fetching of chunks bigger than 16000 bytes.
#2970 Store: Upgrade minio-go/v7 to fix slowness when running on EKS.
#2957 Rule: breaking ⚠️ Now sets all of the relevant fields properly; avoids a panic when /api/v1/rules is called and the time zone is not UTC; rules field is an empty array now if no rules have been defined in a rule group.
Thanos Rule’s /api/v1/rules endpoint no longer returns the old, deprecated partial_response_strategy. The old, deprecated value has been fixed to WARN for quite some time. Please use partialResponseStrategy.
#2976 Query: Better rounding for incoming query timestamps.
#2929 Mixin: Fix expression for ‘unhealthy sidecar’ alert and increase the timeout for 10 minutes.
#3024 Query: Consider group name and file for deduplication.
#3012 Ruler,Receiver: Fix TSDB to delete blocks in atomic way.
#3046 Ruler,Receiver: Fixed framing of StoreAPI response, it was one chunk by one.
#3095 Ruler: Update the manager when all rule files are removed.
#3105 Querier: Fix overwriting maxSourceResolution when auto downsampling is enabled.
#3010 Querier: Added --query.lookback-delta flag to override the default lookback delta in PromQL. The flag should be lookback delta should be set to at least 2 times of the slowest scrape interval. If unset it will use the PromQL default of 5m.
#2926 API: Add new blocks HTTP API to serve blocks metadata. The status endpoints (/api/v1/status/flags, /api/v1/status/runtimeinfo and /api/v1/status/buildinfo) are now available on all components with a HTTP API.
#2892 Receive: Receiver fails when the initial upload fails.
#2893 Store: Rename metric thanos_bucket_store_cached_postings_compression_time_seconds to thanos_bucket_store_cached_postings_compression_time_seconds_total.
#2915 Receive,Ruler: Enable TSDB directory locking by default. Add a new flag (--tsdb.no-lockfile) to override behavior.
#2902 Querier UI:Separate dedupe and partial response checkboxes per panel in new UI.
#2991 Store: breaking ⚠️operation label value getrange changed to get_range for thanos_store_bucket_cache_operation_requests_total and thanos_store_bucket_cache_operation_hits_total to be consistent with bucket operation metrics.
#2876 Receive,Ruler: Updated TSDB and switched to ChunkIterators instead of sample one, which avoids unnecessary decoding / encoding.
#3064 s3: breaking ⚠️ Add SSE/SSE-KMS/SSE-C configuration. The S3 encrypt_sse: true option is now deprecated in favour of sse_config. If you used encrypt_sse, the migration strategy is to set up the following block:
#2012 Receive: Added multi-tenancy support (based on header)
#2502 StoreAPI: Added hints field to SeriesResponse. Hints in an opaque data structure that can be used to carry additional information from the store and its content is implementation specific.
#2521 Sidecar: Added thanos_sidecar_reloader_reloads_failed_total, thanos_sidecar_reloader_reloads_total, thanos_sidecar_reloader_watch_errors_total, thanos_sidecar_reloader_watch_events_total and thanos_sidecar_reloader_watches metrics.
#2412 UI: Added React UI from Prometheus upstream. Currently only accessible from Query component as only /graph endpoint is migrated.
#2532 Store: Added hidden option --store.caching-bucket.config=<yaml content> (or --store.caching-bucket.config-file=<file.yaml>) for experimental caching bucket, that can cache chunks into shared memcached. This can speed up querying and reduce number of requests to object storage.
#2579 Store: Experimental caching bucket can now cache metadata as well. Config has changed from #2532.
#2526 Compact: In case there are no labels left after deduplication via --deduplication.replica-label, assign first replica-label with value deduped.
#2621 Receive: Added flag to configure forward request timeout. Receive write will complete request as soon as quorum of writes succeeds.
#2513 Tools: Moved thanos bucket commands to thanos tools bucket, also
moved thanos check rules to thanos tools rules-check. thanos tools rules-check also takes rules by --rules repeated flag not argument
#2548 Store, Querier: remove duplicated chunks on StoreAPI.
#2252 Query: add new --store-strict flag. More information available here.
#2265 Compact: add --wait-interval to specify compaction wait interval between consecutive compact runs when --wait is enabled.
#2250 Compact: enable vertical compaction for offline deduplication (experimental). Uses --deduplication.replica-label flag to specify the replica label on which to deduplicate (hidden). Please note that this uses a NAIVE algorithm for merging (no smart replica deduplication, just chaining samples together). This works well for deduplication of blocks with precisely the same samples like those produced by Receiver replication. We plan to add a smarter algorithm in the following weeks.
#1714 Compact: the compact component now exposes the bucket web UI when it is run as a long-lived process.
#2304 Store: added max_item_size configuration option to memcached-based index cache. This should be set to the max item size configured in memcached (-I flag) in order to not waste network round-trips to cache items larger than the limit configured in memcached.
#2297 Store: add --experimental.enable-index-cache-postings-compression flag to enable re-encoding and compressing postings before storing them into the cache. Compressed postings take about 10% of the original size.
#2357 Compact and Store: the compact and store components now serve the bucket UI on :<http-port>/loaded, which shows exactly the blocks that are currently seen by compactor and the store gateway. The compactor also serves a different bucket UI on :<http-port>/global, which shows the status of object storage without any filters.
#2172 Store: add support for sharding the store component based on the label hash.
#2113 Bucket: added thanos bucket replicate command to replicate blocks from one bucket to another.
#1922 Docs: create a new document to explain sharding in Thanos.
#2136breaking Store, Compact, Bucket: schedule block deletion by adding deletion-mark.json. This adds a consistent way for multiple readers and writers to access object storage.
Since there are no consistency guarantees provided by some Object Storage providers, this PR adds a consistent lock-free way of dealing with Object Storage irrespective of the choice of object storage. In order to achieve this co-ordination, blocks are not deleted directly. Instead, blocks are marked for deletion by uploading the deletion-mark.json file for the block that was chosen to be deleted. This file contains Unix time of when the block was marked for deletion. If you want to keep existing behavior, you should add --delete-delay=0s as a flag.
#2090breaking Downsample command: the downsample command has moved and is now a sub-command of the thanos bucket sub-command; it cannot be called via thanos downsample any more.
#2294 Store: optimizations for fetching postings. Queries using =~".*" matchers or negation matchers (!=... or !~...) benefit the most.
#2301 Ruler: exit with an error when initialization fails.
#2310 Query: report timespan 0 to 0 when discovering no stores.
#2330 Store: index-header is no longer experimental. It is enabled by default for store Gateway. You can disable it with new hidden flag: --store.disable-index-header. The --experimental.enable-index-header flag was removed.
#1848 Ruler: allow returning error messages when a reload is triggered via HTTP.
#2270 All: Thanos components will now print stack traces when they error out.
#1952 Store Gateway: Implemented binary index header. This significantly reduces resource consumption (memory, CPU, net bandwidth) for startup and data loading processes as well as baseline memory. This means that adding more blocks into object storage, without querying them will use almost no resources. This, however, still means that querying large amounts of data will result in high spikes of memory and CPU use as before, due to simply fetching large amounts of metrics data. Since we fixed baseline, we are now focusing on query performance optimizations in separate initiatives. To enable experimental index-header mode run store with hidden experimental.enable-index-header flag.
#2009 Store Gateway: Minimum age of all blocks before they are being read. Set it to a safe value (e.g 30m) if your object storage is eventually consistent. GCS and S3 are (roughly) strongly consistent.
#1970breaking Receive: Use gRPC for forwarding requests between peers. Note that existing values for the --receive.local-endpoint flag and the endpoints in the hashring configuration file must now specify the receive gRPC port and must be updated to be a simple host:port combination, e.g. 127.0.0.1:10901, rather than a full HTTP URL, e.g. http://127.0.0.1:10902/api/v1/receive.
#1933 Add a flag --tsdb.wal-compression to configure whether to enable tsdb wal compression in ruler and receiver.
#2021 Rename metric thanos_query_duplicated_store_address to thanos_query_duplicated_store_addresses_total and thanos_rule_duplicated_query_address to thanos_rule_duplicated_query_addresses_total.
#2166 Bucket Web: improve the tooltip for the bucket UI; it was reconstructed and now exposes much more information about blocks.
#1656 Store Gateway: Store now starts metric and status probe HTTP server earlier in its start-up sequence. /-/healthy endpoint now starts to respond with success earlier. /metrics endpoint starts serving metrics earlier as well. Make sure to point your readiness probes to the /-/ready endpoint rather than /metrics.
#1669 Store Gateway: Fixed store sharding. Now it does not load excluded meta.jsons and load/fetch index-cache.json files.
#1670 Sidecar: Fixed un-ordered blocks upload. Sidecar now uploads the oldest blocks first.
#1568 Store Gateway: Store now retains the first raw value of a chunk during downsampling to avoid losing some counter resets that occur on an aggregation boundary.
#1666 Compact: thanos_compact_group_compactions_total now counts block compactions, so operations that resulted in a compacted block. The old behaviour
is now exposed by new metric: thanos_compact_group_compaction_runs_started_total and thanos_compact_group_compaction_runs_completed_total which counts compaction runs overall.
#1632 Removes the duplicated external labels detection on Thanos Querier; warning only; Made Store Gateway compatible with older Querier versions.
NOTE: thanos_store_nodes_grpc_connections metric is now per external_labels and store_type. It is a recommended metric for Querier storeAPIs. thanos_store_node_info is marked as obsolete and will be removed in next release.
NOTE2: Store Gateway is now advertising artificial: "@thanos_compatibility_store_type=store" label. This is to have the current Store Gateway compatible with Querier pre v0.8.0.
This label can be disabled by hidden debug.advertise-compatibility-label=false flag on Store Gateway.
#1478 Thanos components now exposes gRPC server metrics as soon as server starts, to provide more reliable data for instrumentation.
#1378 Thanos Receive now exposes thanos_receive_config_hash, thanos_receive_config_last_reload_successful and thanos_receive_config_last_reload_success_timestamp_seconds metrics to track latest configuration change
#1268 Thanos Sidecar added support for newest Prometheus streaming remote read added here. This massively improves memory required by single
request for both Prometheus and sidecar. Single requests now should take constant amount of memory on sidecar, so resource consumption prediction is now straightforward. This will be used if you have Prometheus 2.13 or 2.12-master.
#1358 Added part_size configuration option for HTTP multipart requests minimum part size for S3 storage type
#1363 Thanos Receive now exposes thanos_receive_hashring_nodes and thanos_receive_hashring_tenants metrics to monitor status of hash-rings
#1395 Thanos Sidecar added /-/ready and /-/healthy endpoints to Thanos sidecar.
#1297 Thanos Compact added /-/ready and /-/healthy endpoints to Thanos compact.
#1431 Thanos Query added hidden flag to allow the use of downsampled resolution data for instant queries.
#1408 Thanos Store Gateway can now allow the specifying of supported time ranges it will serve (time sharding). Flags: min-time & max-time
#1458 Thanos Query and Receive now use common instrumentation middleware. As as result, for sake of http_requests_total and http_request_duration_seconds_bucket; Thanos Query no longer exposes thanos_query_api_instant_query_duration_seconds, thanos_query_api_range_query_duration_second metrics and Thanos Receive no longer exposes thanos_http_request_duration_seconds, thanos_http_requests_total, thanos_http_response_size_bytes.
#1253 Add support for specifying a maximum amount of retries when using Azure Blob storage (default: no retries).
#1244 Thanos Compact now exposes new metrics thanos_compact_downsample_total and thanos_compact_downsample_failures_total which are useful to catch when errors happen
#1260 Thanos Query/Rule now exposes metrics thanos_querier_store_apis_dns_provider_results and thanos_ruler_query_apis_dns_provider_results which tell how many addresses were configured and how many were actually discovered respectively
#1248 Add a web UI to show the state of remote storage.
#1217 Thanos Receive gained basic hashring support
#1262 Thanos Receive got a new metric thanos_http_requests_total which shows how many requests were handled by it
#1243 Thanos Receive got an ability to forward time series data between nodes. Now you can pass the hashring configuration via --receive.hashrings-file; the refresh interval --receive.hashrings-file-refresh-interval; the name of the local node’s name --receive.local-endpoint; and finally the header’s name which is used to determine the tenant --receive.tenant-header.
#1147 Support for the Jaeger tracer has been added!
breaking New common flags were added for configuring tracing: --tracing.config-file and --tracing.config. You can either pass a file to Thanos with the tracing configuration or pass it in the command line itself. Old --gcloudtrace.* flags were removed ⚠️
To migrate over the old --gcloudtrace.* configuration, your tracing configuration should look like this:
#1284 Add support for multiple label-sets in Info gRPC service.
This deprecates the single Labels slice of the InfoResponse, in a future release backward compatible handling for the single set of Labels will be removed. Upgrading to v0.6.0 or higher is advised.
breaking If you run have duplicate queries in your Querier configuration with hierarchical federation of multiple Queries this PR makes Thanos Querier to detect this case and block all duplicates. Refer to 0.6.1 which at least allows for single replica to work.
#1314 Removes http_request_duration_microseconds (Summary) and adds http_request_duration_seconds (Histogram) from http server instrumentation used in Thanos APIs and UIs.
#1287 Sidecar now waits on Prometheus’ external labels before starting the uploading process
#1261 Thanos Receive now exposes metrics thanos_http_request_duration_seconds and thanos_http_response_size_bytes properly of each handler
#1274 Iteration limit has been lifted from the LRU cache so there should be no more spam of error messages as they were harmless
#1321 Thanos Query now fails early on a query which only uses external labels - this improves clarity in certain situations
#1227 Some context handling issues were fixed in Thanos Compact; some unnecessary memory allocations were removed in the hot path of Thanos Store.
#1183 Compactor now correctly propagates retriable/haltable errors which means that it will not unnecessarily restart if such an error occurs
#1231 Receive now correctly handles SIGINT and closes without deadlocking
#1278 Fixed inflated values problem with sum() on Thanos Query
#1280 Fixed a problem with concurrent writes to a map in Thanos Query while rendering the UI
#1311 Fixed occasional panics in Compact and Store when using Azure Blob cloud storage caused by lack of error checking in client library.
#1322 Removed duplicated closing of the gRPC listener - this gets rid of harmless messages like store gRPC listener: close tcp 0.0.0.0:10901: use of closed network connection when those programs are being closed
#1118breaking swift: Added support for cross-domain authentication by introducing userDomainID, userDomainName, projectDomainID, projectDomainName.
The outdated terms tenantID, tenantName are deprecated and have been replaced by projectID, projectName.
⚠️ IMPORTANT ⚠️ This is the last release that supports gossip. From Thanos v0.5.0, gossip will be completely removed.
This release also disables gossip mode by default for all components.
See this for more details.
⚠️ This release moves Thanos docker images (NOT artifacts by accident) to Golang 1.12. This release includes change in GC’s memory release which gives following effect:
On Linux, the runtime now uses MADV_FREE to release unused memory. This is more efficient but may result in higher reported RSS. The kernel will reclaim the unused data when it is needed. To revert to the Go 1.11 behavior (MADV_DONTNEED), set the environment variable GODEBUG=madvdontneed=1.
If you want to see exact memory allocation of Thanos process:
Use go_memstats_heap_alloc_bytes metric exposed by Golang or container_memory_working_set_bytes exposed by cadvisor.
Add GODEBUG=madvdontneed=1 before running Thanos binary to revert to memory releasing to pre 1.12 logic.
#910 Query’s stores UI page is now sorted by type and old DNS or File SD stores are removed after 5 minutes (configurable via the new --store.unhealthy-timeout=5m flag).
#905 Thanos support for Query API: /api/v1/labels. Notice that the API was added in Prometheus v2.6.
#798 Ability to limit the maximum number of concurrent request to Series() calls in Thanos Store and the maximum amount of samples we handle.
#1060 Allow specifying region attribute in S3 storage configuration
⚠️ WARNING ⚠️ #798 adds a new default limit to Thanos Store: --store.grpc.series-max-concurrency. Most likely you will want to make it the same as --query.max-concurrent on Thanos Query.
New Store flags:
* `--store.grpc.series-sample-limit` limits the amount of samples that might be retrieved on a single Series() call. By default it is 0. Consider enabling it by setting it to more than 0 if you are running on limited resources.
* `--store.grpc.series-max-concurrency` limits the number of concurrent Series() calls in Thanos Store. By default it is 20. Considering making it lower or bigger depending on the scale of your deployment.
New Store metrics:
* `thanos_bucket_store_queries_dropped_total` shows how many queries were dropped due to the samples limit;
* `thanos_bucket_store_queries_concurrent_max` is a constant metric which shows how many Series() calls can concurrently be executed by Thanos Store;
* `thanos_bucket_store_queries_in_flight` shows how many queries are currently "in flight" i.e. they are being executed;
* `thanos_bucket_store_gate_duration_seconds` shows how many seconds it took for queries to pass through the gate in both cases - when that fails and when it does not.
New Store tracing span:
* store_query_gate_ismyturn shows how long it took for a query to pass (or not) through the gate.
New Querier and Ruler flag: -- store.sd-dns-resolver which allows to specify resolver to use. Either golang or miekgdns
#986 Allow to save some startup & sync time in store gateway as it is no longer needed to compute index-cache from block index on its own for larger blocks.
The store Gateway still can do it, but it first checks bucket if there is index-cached uploaded already.
In the same time, compactor precomputes the index cache file on every compaction.
New Compactor flag: --index.generate-missing-cache-file was added to allow quicker addition of index cache files. If enabled it precomputes missing files on compactor startup. Note that it will take time and it’s only one-off step per bucket.
#887 Compact: Added new --block-sync-concurrency flag, which allows you to configure number of goroutines to use when syncing block metadata from object storage.
#928 Query: Added --store.response-timeout flag. If a Store doesn’t send any data in this specified duration then a Store will be ignored and partial data will be returned if it’s enabled. 0 disables timeout.
#893 S3 storage backend has graduated to stable maturity level.
#936 Azure storage backend has graduated to stable maturity level.
#937 S3: added trace functionality. You can add trace.enable: true to enable the minio client’s verbose logging.
#953 Compact: now has a hidden flag --debug.accept-malformed-index. Compaction index verification will ignore out of order label names.
#963 GCS: added possibility to inline ServiceAccount into GCS config.
#1010 Compact: added new flag --compact.concurrency. Number of goroutines to use when compacting groups.
#1028 Query: added --query.default-evaluation-interval, which sets default evaluation interval for sub queries.
#980 Ability to override Azure storage endpoint for other regions (China)
#970 Deprecated partial_response_disabled proto field. Added partial_response_strategy instead. Both in gRPC and Query API.
No PartialResponseStrategy field for RuleGroups by default means abort strategy (old PartialResponse disabled) as this is recommended option for Rules and alerts.
Added thanos_rule_evaluation_with_warnings_total to Ruler.
DNS thanos_ruler_query_apis* are now thanos_ruler_query_apis_* for consistency.
DNS thanos_querier_store_apis* are now thanos_querier_store_apis__* for consistency.
Query Gate thanos_bucket_store_series* are now thanos_bucket_store_series_* for consistency.
Most of thanos ruler metris related to rule manager has strategy label.
Ruler tracing spans:
/rule_instant_query HTTP[client] is now /rule_instant_query_part_resp_abort HTTP[client]" if request is for abort strategy.
#1009: Upgraded Prometheus (~v2.7.0-rc.0 to v2.8.1) and TSDB (v0.4.0 to v0.6.1) deps.
Changes that affects Thanos:
[ENHANCEMENT] In histogram_quantile merge buckets with equivalent le values. #5158.
[ENHANCEMENT] Show list of offending labels in the error message in many-to-many scenarios. #5189
[BUGFIX] Fix panic when aggregator param is not a literal. #5290
[ENHANCEMENT] Reduce time that Alertmanagers are in flux when reloaded. #5126
[BUGFIX] prometheus_rule_group_last_evaluation_timestamp_seconds is now a unix timestamp. #5186
[BUGFIX] prometheus_rule_group_last_duration_seconds now reports seconds instead of nanoseconds. Fixes our issue #1027
[BUGFIX] Fix sorting of rule groups. #5260
store: [ENHANCEMENT] Fast path for EmptyPostings cases in Merge, Intersect and Without.
tooling: [FEATURE] New dump command to tsdb tool to dump all samples.
[ENHANCEMENT] When closing the db any running compaction will be cancelled so it doesn’t block.
[CHANGE] breaking Renamed flag --sync-delay to --consistency-delay#1053
Note that this was added on TSDB and Prometheus: [FEATURE] Time-ovelapping blocks are now allowed. #370
Whoever due to nature of Thanos compaction (distributed systems), for safety reason this is disabled for Thanos compactor for now.
⚠️ WARNING ⚠️ #873 fix fixes actual handling of index-cache-size. Handling of limit for this cache was
broken so it was unbounded all the time. From this release actual value matters and is extremely low by default. To “revert”
the old behaviour (no boundary), use a large enough value.