All notable changes to this project will be documented in this file.
NOTE: As semantic versioning states all 0.y.z releases can contain breaking changes in API (flags, grpc API, any backward compatibility)
We use breaking word for marking changes that are not backward compatible (relates only to v0.y.z releases.)
--store.unhealthy-timeoutwas never respected
--store-strictflag. More information available here.
--wait-intervalto specify compaction wait interval between consecutive compact runs when
--deduplication.replica-labelflag to specify the replica label to deduplicate on (Hidden). Please note that this uses a NAIVE algorithm for merging (no smart replica deduplication, just chaining samples together). This works well for deduplication of blocks with precisely the same samples like produced by Receiver replication. We plan to add a smarter algorithm in the following weeks.
max_item_sizeconfig option to memcached-based index cache. This should be set to the max item size configured in memcached (
-Iflag) in order to not waste network round-trips to cache items larger than the limit configured in memcached.
--experimental.enable-index-cache-postings-compressionflag to enable reencoding and compressing postings before storing them into cache. Compressed postings take about 10% of the original size.
deletion-mark.jsonfile for the block that was chosen to be deleted. This file contains unix time of when the block was marked for deletion. If you want to keep existing behavior, you should add
--delete-delay=0sas a flag.
downsamplecommand has moved as the
thanos bucketsub-command, and cannot be called via
thanos downsampleany more.
=~".*"matchers or negation matchers (
!~...) benefit the most.
--experimental.enable-index-headerflag was removed.
index-headermode run store with hidden
--query.config-fileCLI flags. See documentation for further information.
thanos_proxy_store_empty_stream_responses_totalmetric for number of empty responses from stores.
thanos bucket replicate.
--receive.local-endpointflag and the endpoints in the hashring configuration file must now specify the receive gRPC port and must be updated to be a simple
127.0.0.1:10901, rather than a full HTTP URL, e.g.
--tsdb.wal-compressionto configure whether to enable tsdb wal compression in ruler and receiver.
#1937 Compactor: Improved synchronization of meta JSON files. Compactor now properly handles partial block uploads for all operation like retention apply, downsampling and compaction. Additionally:
#1936 Store: Improved synchronization of meta JSON files. Store now properly handles corrupted disk cache. Added meta.json sync metrics.
#1856 Receive: close DBReadOnly after flushing to fix a memory leak.
#1882 Receive: upload to object storage as ‘receive’ rather than ‘sidecar’.
#1907 Store: Fixed the duration unit for the metric
#1931 Compact: Fixed the compactor successfully exiting when actually an error occurred while compacting a blocks group.
/api/v1/rules now shows a properly formatted value
master container images are now built with Go 1.13
#1956 Ruler: now properly ignores duplicated query addresses
#1975 Store Gateway: fixed panic caused by memcached servers selector when there’s 1 memcached node
AWS_CONTAINER_CREDENTIALS_FULL_URIby upgrading to minio-go v6.0.44
--alertmanagers.config-fileCLI flags. See documentation for further information.
--alertmanagers.sd-dns-intervalCLI option to specify the interval between DNS resolutions of Alertmanager hosts.
POST- useful for sending bigger requests
#1947 Upgraded Prometheus dependencies to v2.15.2. This includes:
#1867 Ruler: now sets a
User-Agent in requests
#1887 Service discovery now deduplicates targets between different target groups
--grpc-grace-periodCLI option to components which serve gRPC to set how long to wait until gRPC Server shuts down.
--prometheus.ready_timeoutCLI option to the sidecar to set how long to wait until Prometheus starts up.
AliYun OSSobject storage, see documents for further information.
--http-grace-periodCLI option to components which serve HTTP to set how long to wait until HTTP Server shuts down.
--http-addressto match other components.
thanos_compactor_iterations_totalon Thanos Compactor which shows the number of successful iterations.
thanos bucket webnow supports
--web.external-prefixfor proxying on a subpath.
--web.prefix-headerflags to allow for bucket UI to be accessible behind a reverse proxy.
/-/healthyendpoint now starts to respond with success earlier.
/metricsendpoint starts serving metrics earlier as well. Make sure to point your readiness probes to the
/-/readyendpoint rather than
failed to assert type of rule ...message.
--web.external-prefix404s for static resources.
thanos_compact_group_compactions_totalnow counts block compactions, so operations that resulted in a compacted block. The old behaviour is now exposed by new metric:
thanos_compact_group_compaction_runs_completed_totalwhich counts compaction runs overall.
prober_healthymetrics are removed, for sake of
statusexposes same metric with a label,
checkcan have “healty” or “ready” depending on status of the probe.
thanos_store_nodes_grpc_connectionsmetric is now per
store_type. It is a recommended metric for Querier storeAPIs.
thanos_store_node_infois marked as obsolete and will be removed in next release.
"@thanos_compatibility_store_type=store"label. This is to have the current Store Gateway compatible with Querier pre v0.8.0. This label can be disabled by hidden
debug.advertise-compatibility-label=falseflag on Store Gateway.
Lot’s of improvements this release! Noteworthy items:
- First Katacoda tutorial! 🐱
- Fixed Deletion order causing Compactor to produce not needed 👻 blocks with missing random files.
- Store GW memory improvements (more to come!).
- Querier allows multiple deduplication labels.
- Both Compactor and Store Gateway can be sharded within the same bucket using relabelling!
- Sidecar exposed data from Prometheus can be now limited to given
min-time (e.g 3h only).
- Numerous Thanos Receive improvements.
Make sure you check out Prometheus 2.13.0 as well. New release drastically improves usage and resource consumption of both Prometheus and sidecar with Thanos: https://prometheus.io/blog/2019/10/10/remote-read-meets-streaming/
selector.relabel-config) into Thanos Store and Compact components. Selecting blocks to serve depends on the result of block labels relabeling.
/query_rangequerier endpoints. When provided overwrite the
query.replica-labelconfiguration can be provided more than once for multiple deduplication labels like:
labelto be consistent with other commands.
Accepted into CNCF:
- Thanos moved to new repository https://github.com/thanos-io/thanos
- Docker images moved to https://quay.io/thanos/thanos and mirrored at https://hub.docker.com/r/thanosio/thanos
- Slack moved to https://slack.cncf.io
thanos_receive_config_last_reload_success_timestamp_secondsmetrics to track latest configuration change
part_sizeconfiguration option for HTTP multipart requests minimum part size for S3 storage type
thanos_receive_hashring_tenantsmetrics to monitor status of hash-rings
/-/healthyendpoints to Thanos sidecar.
/-/healthyendpoints to Thanos compact.
/seriesAPI end-point now properly returns an empty array just like Prometheus if there are no results
http_request_duration_seconds_bucket; Thanos Query no longer exposes
thanos_query_api_range_query_duration_secondmetrics and Thanos Receive no longer exposes
thanos check rules linter for Thanos rule rules files.
#1253 Add support for specifying a maximum amount of retries when using Azure Blob storage (default: no retries).
#1244 Thanos Compact now exposes new metrics
thanos_compact_downsample_failures_total which are useful to catch when errors happen
#1260 Thanos Query/Rule now exposes metrics
thanos_ruler_query_apis_dns_provider_results which tell how many addresses were configured and how many were actually discovered respectively
#1248 Add a web UI to show the state of remote storage.
#1217 Thanos Receive gained basic hashring support
#1262 Thanos Receive got a new metric
thanos_http_requests_total which shows how many requests were handled by it
#1243 Thanos Receive got an ability to forward time series data between nodes. Now you can pass the hashring configuration via
--receive.hashrings-file; the refresh interval
--receive.hashrings-file-refresh-interval; the name of the local node’s name
--receive.local-endpoint; and finally the header’s name which is used to determine the tenant
#1147 Support for the Jaeger tracer has been added!
breaking New common flags were added for configuring tracing:
--tracing.config. You can either pass a file to Thanos with the tracing configuration or pass it in the command line itself. Old
--gcloudtrace.* flags were removed ⚠️
To migrate over the old
--gcloudtrace.* configuration, your tracing configuration should look like this:
--- type: STACKDRIVER config: - service_name: 'foo' project_id: '123' sample_factor: 123
type you can use is
JAEGER now. The
config keys and values are Jaeger specific and you can find all of the information here.
#1284 Add support for multiple label-sets in Info gRPC service.
This deprecates the single
Labels slice of the
InfoResponse, in a future release backward compatible handling for the single set of Labels will be removed. Upgrading to v0.6.0 or higher is advised.
breaking If you run have duplicate queries in your Querier configuration with hierarchical federation of multiple Queries this PR makes Thanos Querier to detect this case and block all duplicates. Refer to 0.6.1 which at least allows for single replica to work.
http_request_duration_microseconds (Summary) and adds
http_request_duration_seconds (Histogram) from http server instrumentation used in Thanos APIs and UIs.
#1287 Sidecar now waits on Prometheus’ external labels before starting the uploading process
#1261 Thanos Receive now exposes metrics
thanos_http_response_size_bytes properly of each handler
#1274 Iteration limit has been lifted from the LRU cache so there should be no more spam of error messages as they were harmless
#1321 Thanos Query now fails early on a query which only uses external labels - this improves clarity in certain situations
#1227 Some context handling issues were fixed in Thanos Compact; some unnecessary memory allocations were removed in the hot path of Thanos Store.
#1183 Compactor now correctly propagates retriable/haltable errors which means that it will not unnecessarily restart if such an error occurs
#1231 Receive now correctly handles SIGINT and closes without deadlocking
#1278 Fixed inflated values problem with
sum() on Thanos Query
#1280 Fixed a problem with concurrent writes to a
map in Thanos Query while rendering the UI
#1311 Fixed occasional panics in Compact and Store when using Azure Blob cloud storage caused by lack of error checking in client library.
#1322 Removed duplicated closing of the gRPC listener - this gets rid of harmless messages like
store gRPC listener: close tcp 0.0.0.0:10901: use of closed network connection when those programs are being closed
TL;DR: Store LRU cache is no longer leaking, Upgraded Thanos UI to Prometheus 2.9, Fixed auto-downsampling, Moved to Go 1.12.5 and more.
This version moved tarballs to Golang 1.12.5 from 1.11 as well, so same warning applies if you use
container_memory_usage_bytes from cadvisor. Use
breaking As announced couple of times this release also removes gossip with all configuration flags (
#1118 breaking swift: Added support for cross-domain authentication by introducing
The outdated terms
tenantName are deprecated and have been replaced by
#1066 Upgrade Thanos ui to Prometheus v2.9.1.
Changes from the upstream: * query: - [ENHANCEMENT] Update moment.js and moment-timezone.js PR #4679 - [ENHANCEMENT] Support to query elements by a specific time PR #4764 - [ENHANCEMENT] Update to Bootstrap 4.1.3 PR #5192 - [BUGFIX] Limit number of merics in prometheus UI PR #5139 - [BUGFIX] Web interface Quality of Life improvements PR #5201 * rule: - [ENHANCEMENT] Improve rule views by wrapping lines PR #4702 - [ENHANCEMENT] Show rule evaluation errors on rules page PR #4457
#1190 Updated minio deps (S3 bucket client). This fixes minio retries.
#1133 Use prometheus v2.9.2, common v0.4.0 & tsdb v0.8.0.
Changes from the upstreams: * store gateway: - [ENHANCEMENT] Fast path for EmptyPostings cases in Merge, Intersect and Without. * store gateway & compactor: - [BUGFIX] Fix fd and vm_area leak on error path in chunks.NewDirReader. - [BUGFIX] Fix fd and vm_area leak on error path in index.NewFileReader. * query: - [BUGFIX] Make sure subquery range is taken into account for selection #5467 - [ENHANCEMENT] Check for cancellation on every step of a range evaluation. #5131 - [BUGFIX] Exponentation operator to drop metric name in result of operation. #5329 - [BUGFIX] Fix output sample values for scalar-to-vector comparison operations. #5454 * rule: - [BUGFIX] Reload rules: copy state on both name and labels. #5368
--cluster.*flags removed and Thanos will error out if any is provided.
⚠️ IMPORTANT ⚠️ This is the last release that supports gossip. From Thanos v0.5.0, gossip will be completely removed.
This release also disables gossip mode by default for all components. See this for more details.
⚠️ This release moves Thanos docker images (NOT artifacts by accident) to Golang 1.12. This release includes change in GC’s memory release which gives following effect (source: https://golang.org/doc/go1.12):
On Linux, the runtime now uses MADV_FREE to release unused memory. This is more efficient but may result in higher reported RSS. The kernel will reclaim the unused data when it is needed. To revert to the Go 1.11 behavior (MADV_DONTNEED), set the environment variable GODEBUG=madvdontneed=1.
If you want to see exact memory allocation of Thanos process:
go_memstats_heap_alloc_bytes metric exposed by Golang or
container_memory_working_set_bytes exposed by cadvisor.
GODEBUG=madvdontneed=1 before running Thanos binary to revert to memory releasing to pre 1.12 logic.
container_memory_usage_bytes metric could be misleading e.g: https://github.com/google/cadvisor/issues/2242
⚠️ WARNING ⚠️ #798 adds a new default limit to Thanos Store:
--store.grpc.series-max-concurrency. Most likely you will want to make it the same as
--query.max-concurrent on Thanos Query.
New Store flags:
* `--store.grpc.series-sample-limit` limits the amount of samples that might be retrieved on a single Series() call. By default it is 0. Consider enabling it by setting it to more than 0 if you are running on limited resources. * `--store.grpc.series-max-concurrency` limits the number of concurrent Series() calls in Thanos Store. By default it is 20. Considering making it lower or bigger depending on the scale of your deployment.
New Store metrics:
* `thanos_bucket_store_queries_dropped_total` shows how many queries were dropped due to the samples limit; * `thanos_bucket_store_queries_concurrent_max` is a constant metric which shows how many Series() calls can concurrently be executed by Thanos Store; * `thanos_bucket_store_queries_in_flight` shows how many queries are currently "in flight" i.e. they are being executed; * `thanos_bucket_store_gate_duration_seconds` shows how many seconds it took for queries to pass through the gate in both cases - when that fails and when it does not.
New Store tracing span:
store_query_gate_ismyturn shows how long it took for a query to pass (or not) through the gate.
New Querier and Ruler flag:
-- store.sd-dns-resolver which allows to specify resolver to use. Either
New Compactor flag:
--index.generate-missing-cache-file was added to allow quicker addition of index cache files. If enabled it precomputes missing files on compactor startup. Note that it will take time and it’s only one-off step per bucket.
--block-sync-concurrencyflag, which allows you to configure number of goroutines to use when syncing block metadata from object storage.
--store.response-timeoutflag. If a Store doesn’t send any data in this specified duration then a Store will be ignored and partial data will be returned if it’s enabled. 0 disables timeout.
trace.enable: trueto enable the minio client’s verbose logging.
--debug.accept-malformed-index. Compaction index verification will ignore out of order label names.
--compact.concurrency. Number of goroutines to use when compacting groups.
--query.default-evaluation-interval, which sets default evaluation interval for sub queries.
seriesnow supports POST method.
query_rangenow supports POST method.
partial_response_disabledproto field. Added
partial_response_strategyinstead. Both in gRPC and Query API. No
RuleGroupsby default means
abortstrategy (old PartialResponse disabled) as this is recommended option for Rules and alerts.
* Added `thanos_rule_evaluation_with_warnings_total` to Ruler. * DNS `thanos_ruler_query_apis*` are now `thanos_ruler_query_apis_*` for consistency. * DNS `thanos_querier_store_apis*` are now `thanos_querier_store_apis__*` for consistency. * Query Gate `thanos_bucket_store_series*` are now `thanos_bucket_store_series_*` for consistency. * Most of thanos ruler metris related to rule manager has `strategy` label.
Ruler tracing spans:
* `/rule_instant_query HTTP[client]` is now `/rule_instant_query_part_resp_abort HTTP[client]"` if request is for abort strategy.
Changes that affects Thanos:
* [ENHANCEMENT] In histogram_quantile merge buckets with equivalent le values. #5158.
* [ENHANCEMENT] Show list of offending labels in the error message in many-to-many scenarios. #5189
* [BUGFIX] Fix panic when aggregator param is not a literal. #5290
* [ENHANCEMENT] Reduce time that Alertmanagers are in flux when reloaded. #5126
* [BUGFIX] prometheus_rule_group_last_evaluation_timestamp_seconds is now a unix timestamp. #5186
* [BUGFIX] prometheus_rule_group_last_duration_seconds now reports seconds instead of nanoseconds. Fixes our issue #1027
* [BUGFIX] Fix sorting of rule groups. #5260
* store: [ENHANCEMENT] Fast path for EmptyPostings cases in Merge, Intersect and Without.
* tooling: [FEATURE] New dump command to tsdb tool to dump all samples.
* [ENHANCEMENT] When closing the db any running compaction will be cancelled so it doesn’t block.
* [CHANGE] breaking Renamed flag
For ruler essentially whole TSDB CHANGELOG applies between v0.4.0-v0.6.1: https://github.com/prometheus/tsdb/blob/master/CHANGELOG.md
Note that this was added on TSDB and Prometheus: [FEATURE] Time-ovelapping blocks are now allowed. #370 Whoever due to nature of Thanos compaction (distributed systems), for safety reason this is disabled for Thanos compactor for now.
thanos_objstore_bucket_last_successful_upload_timenow does not appear when no blocks have been uploaded so far.
⚠️ WARNING ⚠️ #873 fix fixes actual handling of
index-cache-size. Handling of limit for this cache was
broken so it was unbounded all the time. From this release actual value matters and is extremely low by default. To “revert”
the old behaviour (no boundary), use a large enough value.
slice bounds out of range.
bucket inspectcommand for better insights on blocks in object storage.
--web.prefix-header. Details here
count_valuesPromQL function. #4585
20default, so no change by default)
put_user_metadataoption to config.
insecure_skip_verifyoption to config.
Next Thanos release adding support to new discovery method, gRPC mTLS and two new object store providers (Swift and Azure).
Note lots of necessary breaking changes in flags that relates to bucket configuration.
thanos_objstore_gcs_bucket_operations_totalin favor of of generic bucket operation metrics.
thanos_prefix to memberlist (gossip) metrics. Make sure to update your dashboards and rules.
"X-Amz-Acl": "bucket-owner-full-control"metadata for s3 upload operation.
--objstore.config-fileto reference to the bucket configuration file in yaml format. Detailed information can be found in document storage.
thanos rule, static configuration of query nodes via
thanos rule, file based discovery of query nodes using
thanos query, file based discovery of store nodes using
/-/healthyendpoint to Querier.
dnssrv+prefixes for the respective lookup. Details here
--cluster.disableflag to disable gossip functionality completely.
thanos_compactor_retries_totalmetric not being registered.
Initial version to have a stable reference before gossip protocol removal.