All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
NOTE: As semantic versioning states all 0.y.z releases can contain breaking changes in API (flags, grpc API, any backward compatibility)
We use breaking word for marking changes that are not backward compatible (relates only to v0.y.z releases.)
AWS_CONTAINER_CREDENTIALS_FULL_URI
by upgrading to minio-go v6.0.44--grpc-grace-period
CLI option to components which serve gRPC to set how long to wait until gRPC Server shuts down.--prometheus.ready_timeout
CLI option to the sidecar to set how long to wait until Prometheus starts up.AliYun OSS
object storage, see documents for further information.--http-grace-period
CLI option to components which serve HTTP to set how long to wait until HTTP Server shuts down.--listen
to --http-address
to match other components.thanos_compactor_iterations_total
on Thanos Compactor which shows the number of successful iterations.thanos bucket web
now supports --web.external-prefix
for proxying on a subpath.--web.prefix-header
flags to allow for bucket UI to be accessible behind a reverse proxy./-/healthy
endpoint now starts to respond with success earlier. /metrics
endpoint starts serving metrics earlier as well. Make sure to point your readiness probes to the /-/ready
endpoint rather than /metrics
.failed to assert type of rule ...
message.--web.external-prefix
404s for static resources.offset
.thanos_compact_group_compactions_total
now counts block compactions, so operations that resulted in a compacted block. The old behaviour
is now exposed by new metric: thanos_compact_group_compaction_runs_started_total
and thanos_compact_group_compaction_runs_completed_total
which counts compaction runs overall.prober_ready
and prober_healthy
metrics are removed, for sake of status
. Now status
exposes same metric with a label, check
. check
can have “healty” or “ready” depending on status of the probe.thanos_store_nodes_grpc_connections
metric is now per external_labels
and store_type
. It is a recommended metric for Querier storeAPIs. thanos_store_node_info
is marked as obsolete and will be removed in next release."@thanos_compatibility_store_type=store"
label. This is to have the current Store Gateway compatible with Querier pre v0.8.0.
This label can be disabled by hidden debug.advertise-compatibility-label=false
flag on Store Gateway.Lot’s of improvements this release! Noteworthy items:
- First Katacoda tutorial! 🐱
- Fixed Deletion order causing Compactor to produce not needed 👻 blocks with missing random files.
- Store GW memory improvements (more to come!).
- Querier allows multiple deduplication labels.
- Both Compactor and Store Gateway can be sharded within the same bucket using relabelling!
- Sidecar exposed data from Prometheus can be now limited to given min-time
(e.g 3h only).
- Numerous Thanos Receive improvements.
Make sure you check out Prometheus 2.13.0 as well. New release drastically improves usage and resource consumption of both Prometheus and sidecar with Thanos: https://prometheus.io/blog/2019/10/10/remote-read-meets-streaming/
--selector.relabel-config-file
and selector.relabel-config
) into Thanos Store and Compact components.
Selecting blocks to serve depends on the result of block labels relabeling./-/ready
and /-/healthy
endpoints./-/ready
and /-/healthy
endpoints./-/ready
and /-/healthy
endpoints./-/ready
and /-/healthy
endpoints./-/ready
and /-/healthy
endpoints.replicaLabels
param for /query
and
/query_range
querier endpoints. When provided overwrite the query.replica-label
cli flags.resendDelay
flag.query.replica-label
configuration can be provided more than
once for multiple deduplication labels like: --query.replica-label=prometheus_replica --query.replica-label=service
.labels
to label
to be consistent with other commands.+
in it.Accepted into CNCF:
- Thanos moved to new repository https://github.com/thanos-io/thanos
- Docker images moved to https://quay.io/thanos/thanos and mirrored at https://hub.docker.com/r/thanosio/thanos
- Slack moved to https://slack.cncf.io #thanos
/#thanos-dev
/#thanos-prs
thanos_receive_config_hash
, thanos_receive_config_last_reload_successful
and thanos_receive_config_last_reload_success_timestamp_seconds
metrics to track latest configuration change2.13
or 2.12-master
.part_size
configuration option for HTTP multipart requests minimum part size for S3 storage typethanos_receive_hashring_nodes
and thanos_receive_hashring_tenants
metrics to monitor status of hash-rings/-/ready
and /-/healthy
endpoints to Thanos sidecar./-/ready
and /-/healthy
endpoints to Thanos compact.min-time
& max-time
downsampling.disable
./series
API end-point now properly returns an empty array just like Prometheus if there are no resultshttp_requests_total
and http_request_duration_seconds_bucket
; Thanos Query no longer exposes thanos_query_api_instant_query_duration_seconds
, thanos_query_api_range_query_duration_second
metrics and Thanos Receive no longer exposes thanos_http_request_duration_seconds
, thanos_http_requests_total
, thanos_http_response_size_bytes
.#1097 Added thanos check rules
linter for Thanos rule rules files.
#1253 Add support for specifying a maximum amount of retries when using Azure Blob storage (default: no retries).
#1244 Thanos Compact now exposes new metrics thanos_compact_downsample_total
and thanos_compact_downsample_failures_total
which are useful to catch when errors happen
#1260 Thanos Query/Rule now exposes metrics thanos_querier_store_apis_dns_provider_results
and thanos_ruler_query_apis_dns_provider_results
which tell how many addresses were configured and how many were actually discovered respectively
#1248 Add a web UI to show the state of remote storage.
#1217 Thanos Receive gained basic hashring support
#1262 Thanos Receive got a new metric thanos_http_requests_total
which shows how many requests were handled by it
#1243 Thanos Receive got an ability to forward time series data between nodes. Now you can pass the hashring configuration via --receive.hashrings-file
; the refresh interval --receive.hashrings-file-refresh-interval
; the name of the local node’s name --receive.local-endpoint
; and finally the header’s name which is used to determine the tenant --receive.tenant-header
.
#1147 Support for the Jaeger tracer has been added!
breaking New common flags were added for configuring tracing: --tracing.config-file
and --tracing.config
. You can either pass a file to Thanos with the tracing configuration or pass it in the command line itself. Old --gcloudtrace.*
flags were removed ⚠️
To migrate over the old --gcloudtrace.*
configuration, your tracing configuration should look like this:
---
type: STACKDRIVER
config:
- service_name: 'foo'
project_id: '123'
sample_factor: 123
The other type
you can use is JAEGER
now. The config
keys and values are Jaeger specific and you can find all of the information here.
#1284 Add support for multiple label-sets in Info gRPC service.
This deprecates the single Labels
slice of the InfoResponse
, in a future release backward compatible handling for the single set of Labels will be removed. Upgrading to v0.6.0 or higher is advised.
breaking If you run have duplicate queries in your Querier configuration with hierarchical federation of multiple Queries this PR makes Thanos Querier to detect this case and block all duplicates. Refer to 0.6.1 which at least allows for single replica to work.
#1314 Removes http_request_duration_microseconds
(Summary) and adds http_request_duration_seconds
(Histogram) from http server instrumentation used in Thanos APIs and UIs.
#1287 Sidecar now waits on Prometheus’ external labels before starting the uploading process
#1261 Thanos Receive now exposes metrics thanos_http_request_duration_seconds
and thanos_http_response_size_bytes
properly of each handler
#1274 Iteration limit has been lifted from the LRU cache so there should be no more spam of error messages as they were harmless
#1321 Thanos Query now fails early on a query which only uses external labels - this improves clarity in certain situations
#1227 Some context handling issues were fixed in Thanos Compact; some unnecessary memory allocations were removed in the hot path of Thanos Store.
#1183 Compactor now correctly propagates retriable/haltable errors which means that it will not unnecessarily restart if such an error occurs
#1231 Receive now correctly handles SIGINT and closes without deadlocking
#1278 Fixed inflated values problem with sum()
on Thanos Query
#1280 Fixed a problem with concurrent writes to a map
in Thanos Query while rendering the UI
#1311 Fixed occasional panics in Compact and Store when using Azure Blob cloud storage caused by lack of error checking in client library.
#1322 Removed duplicated closing of the gRPC listener - this gets rid of harmless messages like store gRPC listener: close tcp 0.0.0.0:10901: use of closed network connection
when those programs are being closed
TL;DR: Store LRU cache is no longer leaking, Upgraded Thanos UI to Prometheus 2.9, Fixed auto-downsampling, Moved to Go 1.12.5 and more.
This version moved tarballs to Golang 1.12.5 from 1.11 as well, so same warning applies if you use container_memory_usage_bytes
from cadvisor. Use container_memory_working_set_bytes
instead.
breaking As announced couple of times this release also removes gossip with all configuration flags (--cluster.*
).
#1118 breaking swift: Added support for cross-domain authentication by introducing userDomainID
, userDomainName
, projectDomainID
, projectDomainName
.
The outdated terms tenantID
, tenantName
are deprecated and have been replaced by projectID
, projectName
.
#1066 Upgrade Thanos ui to Prometheus v2.9.1.
Changes from the upstream: * query: - [ENHANCEMENT] Update moment.js and moment-timezone.js PR #4679 - [ENHANCEMENT] Support to query elements by a specific time PR #4764 - [ENHANCEMENT] Update to Bootstrap 4.1.3 PR #5192 - [BUGFIX] Limit number of merics in prometheus UI PR #5139 - [BUGFIX] Web interface Quality of Life improvements PR #5201 * rule: - [ENHANCEMENT] Improve rule views by wrapping lines PR #4702 - [ENHANCEMENT] Show rule evaluation errors on rules page PR #4457
#1190 Updated minio deps (S3 bucket client). This fixes minio retries.
#1133 Use prometheus v2.9.2, common v0.4.0 & tsdb v0.8.0.
Changes from the upstreams: * store gateway: - [ENHANCEMENT] Fast path for EmptyPostings cases in Merge, Intersect and Without. * store gateway & compactor: - [BUGFIX] Fix fd and vm_area leak on error path in chunks.NewDirReader. - [BUGFIX] Fix fd and vm_area leak on error path in index.NewFileReader. * query: - [BUGFIX] Make sure subquery range is taken into account for selection #5467 - [ENHANCEMENT] Check for cancellation on every step of a range evaluation. #5131 - [BUGFIX] Exponentation operator to drop metric name in result of operation. #5329 - [BUGFIX] Fix output sample values for scalar-to-vector comparison operations. #5454 * rule: - [BUGFIX] Reload rules: copy state on both name and labels. #5368
--cluster.*
flags removed and Thanos will error out if any is provided.⚠️ IMPORTANT ⚠️ This is the last release that supports gossip. From Thanos v0.5.0, gossip will be completely removed.
This release also disables gossip mode by default for all components. See this for more details.
⚠️ This release moves Thanos docker images (NOT artifacts by accident) to Golang 1.12. This release includes change in GC’s memory release which gives following effect (source: https://golang.org/doc/go1.12):
On Linux, the runtime now uses MADV_FREE to release unused memory. This is more efficient but may result in higher reported RSS. The kernel will reclaim the unused data when it is needed. To revert to the Go 1.11 behavior (MADV_DONTNEED), set the environment variable GODEBUG=madvdontneed=1.
If you want to see exact memory allocation of Thanos process:
* Use go_memstats_heap_alloc_bytes
metric exposed by Golang or container_memory_working_set_bytes
exposed by cadvisor.
* Add GODEBUG=madvdontneed=1
before running Thanos binary to revert to memory releasing to pre 1.12 logic.
Using cadvisor container_memory_usage_bytes
metric could be misleading e.g: https://github.com/google/cadvisor/issues/2242
--store.unhealthy-timeout=5m
flag).⚠️ WARNING ⚠️ #798 adds a new default limit to Thanos Store: --store.grpc.series-max-concurrency
. Most likely you will want to make it the same as --query.max-concurrent
on Thanos Query.
New options:
New Store flags:
* `--store.grpc.series-sample-limit` limits the amount of samples that might be retrieved on a single Series() call. By default it is 0. Consider enabling it by setting it to more than 0 if you are running on limited resources.
* `--store.grpc.series-max-concurrency` limits the number of concurrent Series() calls in Thanos Store. By default it is 20. Considering making it lower or bigger depending on the scale of your deployment.
New Store metrics:
* `thanos_bucket_store_queries_dropped_total` shows how many queries were dropped due to the samples limit;
* `thanos_bucket_store_queries_concurrent_max` is a constant metric which shows how many Series() calls can concurrently be executed by Thanos Store;
* `thanos_bucket_store_queries_in_flight` shows how many queries are currently "in flight" i.e. they are being executed;
* `thanos_bucket_store_gate_duration_seconds` shows how many seconds it took for queries to pass through the gate in both cases - when that fails and when it does not.
New Store tracing span:
* store_query_gate_ismyturn
shows how long it took for a query to pass (or not) through the gate.
New Querier and Ruler flag: -- store.sd-dns-resolver
which allows to specify resolver to use. Either golang
or miekgdns
New Compactor flag: --index.generate-missing-cache-file
was added to allow quicker addition of index cache files. If enabled it precomputes missing files on compactor startup. Note that it will take time and it’s only one-off step per bucket.
--block-sync-concurrency
flag, which allows you to configure number of goroutines to use when syncing block metadata from object storage.--store.response-timeout
flag. If a Store doesn’t send any data in this specified duration then a Store will be ignored and partial data will be returned if it’s enabled. 0 disables timeout.stable
maturity level.stable
maturity level.trace.enable: true
to enable the minio client’s verbose logging.--debug.accept-malformed-index
. Compaction index verification will ignore out of order label names.--compact.concurrency
. Number of goroutines to use when compacting groups.--query.default-evaluation-interval
, which sets default evaluation interval for sub queries.series
now supports POST method.query_range
now supports POST method.partial_response_disabled
proto field. Added partial_response_strategy
instead. Both in gRPC and Query API.
No PartialResponseStrategy
field for RuleGroups
by default means abort
strategy (old PartialResponse disabled) as this is recommended option for Rules and alerts.Metrics:
* Added `thanos_rule_evaluation_with_warnings_total` to Ruler.
* DNS `thanos_ruler_query_apis*` are now `thanos_ruler_query_apis_*` for consistency.
* DNS `thanos_querier_store_apis*` are now `thanos_querier_store_apis__*` for consistency.
* Query Gate `thanos_bucket_store_series*` are now `thanos_bucket_store_series_*` for consistency.
* Most of thanos ruler metris related to rule manager has `strategy` label.
Ruler tracing spans:
* `/rule_instant_query HTTP[client]` is now `/rule_instant_query_part_resp_abort HTTP[client]"` if request is for abort strategy.
v0.4.0
to v0.6.1
) deps.Changes that affects Thanos:
* query:
* [ENHANCEMENT] In histogram_quantile merge buckets with equivalent le values. #5158.
* [ENHANCEMENT] Show list of offending labels in the error message in many-to-many scenarios. #5189
* [BUGFIX] Fix panic when aggregator param is not a literal. #5290
* ruler:
* [ENHANCEMENT] Reduce time that Alertmanagers are in flux when reloaded. #5126
* [BUGFIX] prometheus_rule_group_last_evaluation_timestamp_seconds is now a unix timestamp. #5186
* [BUGFIX] prometheus_rule_group_last_duration_seconds now reports seconds instead of nanoseconds. Fixes our issue #1027
* [BUGFIX] Fix sorting of rule groups. #5260
* store: [ENHANCEMENT] Fast path for EmptyPostings cases in Merge, Intersect and Without.
* tooling: [FEATURE] New dump command to tsdb tool to dump all samples.
* compactor:
* [ENHANCEMENT] When closing the db any running compaction will be cancelled so it doesn’t block.
* [CHANGE] breaking Renamed flag --sync-delay
to --consistency-delay
#1053
For ruler essentially whole TSDB CHANGELOG applies between v0.4.0-v0.6.1: https://github.com/prometheus/tsdb/blob/master/CHANGELOG.md
Note that this was added on TSDB and Prometheus: [FEATURE] Time-ovelapping blocks are now allowed. #370 Whoever due to nature of Thanos compaction (distributed systems), for safety reason this is disabled for Thanos compactor for now.
thanos_objstore_bucket_last_successful_upload_time
now does not appear when no blocks have been uploaded so far.0s
⚠️ WARNING ⚠️ #873 fix fixes actual handling of index-cache-size
. Handling of limit for this cache was
broken so it was unbounded all the time. From this release actual value matters and is extremely low by default. To “revert”
the old behaviour (no boundary), use a large enough value.
slice bounds out of range
.<>
!=
.bucket inspect
command for better insights on blocks in object storage.--web.route-prefix
, --web.external-prefix
, --web.prefix-header
. Details herecount_values
PromQL function. #4585block-sync-concurrency
with 20
default, so no change by default)put_user_metadata
option to config.insecure_skip_verify
option to config./stores
.Next Thanos release adding support to new discovery method, gRPC mTLS and two new object store providers (Swift and Azure).
Note lots of necessary breaking changes in flags that relates to bucket configuration.
thanos_objstore_gcs_bucket_operations_total
in favor of of generic bucket operation metrics.thanos_
prefix to memberlist (gossip) metrics. Make sure to update your dashboards and rules."X-Amz-Acl": "bucket-owner-full-control"
metadata for s3 upload operation.--objstore.config-file
to reference to the bucket configuration file in yaml format. Detailed information can be found in document storage.thanos rule
, static configuration of query nodes via --query
thanos rule
, file based discovery of query nodes using --query.file-sd-config.files
thanos query
, file based discovery of store nodes using --store.file-sd-config.files
/-/healthy
endpoint to Querier.dns+
and dnssrv+
prefixes for the respective lookup. Details here--cluster.disable
flag to disable gossip functionality completely.thanos_rule_loaded_rules
metric.thanos_compactor_retries_total
metric not being registered.Initial version to have a stable reference before gossip protocol removal.