#7429: Reloader: introduce TolerateEnvVarExpansionErrors to allow suppressing errors when expanding environment variables in the configuration file. When set, this will ensure that the reloader won’t consider the operation to fail when an unset environment variable is encountered. Note that all unset environment variables are left as is, whereas all set environment variables are expanded as usual.
#7560 Query: Added the possibility of filtering rules by rule_name, rule_group or file to HTTP api.
#7652 Store: Implement metadata API limit in stores.
#7659 Receive: Add support for replication using Cap’n Proto. This protocol has a lower CPU and memory footprint, which leads to a reduction in resource usage in Receivers. Before enabling it, make sure that all receivers are updated to a version which supports this replication method.
#7853 UI: Add support for selecting graph time range with mouse drag.
#7855 Compcat/Query: Add support for comma separated replica labels.
#7654 *: Add ‘–grpc-server-tls-min-version’ flag to allow user to specify TLS version, otherwise default to TLS 1.3
#7317 Tracing: allow specifying resource attributes for the OTLP configuration.
#7367 Store Gateway: log request ID in request logs.
#7361 Query: breaking ⚠️ pass query stats from remote execution from server to client. We changed the protobuf of the QueryAPI, if you use query.mode=distributed you need to update your client (upper level Queriers) first, before updating leaf Queriers (servers).
#7363 Query-frontend: set value of remote_user field in Slow Query Logs from HTTP header
#7335 Dependency: Update minio-go to v7.0.70 which includes support for EKS Pod Identity.
#7477 *: Bump objstore to 20240622095743-1afe5d4bc3cd
#7334 Compactor: do not vertically compact downsampled blocks. Such cases are now marked with no-compact-mark.json. Fixes panic panic: unexpected seriesToChunkEncoder lack of iterations.
#7393 *: breaking ⚠️ Using native histograms for grpc middleware metrics. Metrics grpc_client_handling_seconds and grpc_server_handling_seconds will now be native histograms, if you have enabled native histogram scraping you will need to update your PromQL expressions to use the new metric names.
#7134 Store, Compact: Revert the recursive block listing mechanism introduced in https://github.com/thanos-io/thanos/pull/6474 and use the same strategy as in 0.31. Introduce a --block-discovery-strategy flag to control the listing strategy so that a recursive lister can still be used if the tradeoff of slower but cheaper discovery is preferred.
#7122 Store Gateway: Fix lazy expanded postings estimate base cardinality using posting group with remove keys.
#7166 Receive/MultiTSDB: Do not delete non-uploaded blocks
#7224 Query-frontend: Add Redis username to the client configuration.
#7220 Store Gateway: Fix lazy expanded postings caching partial expanded postings and bug of estimating remove postings with non existent value. Added PromQLSmith based fuzz test to improve correctness.
#7225 Compact: Don’t halt due to overlapping sources when vertical compaction is enabled
#7244 Query: Fix Internal Server Error unknown targetHealth: “unknown” when trying to open the targets page.
#7248 Receive: Fix RemoteWriteAsync was sequentially executed causing high latency in the ingestion path.
#7271 Query: fixing dedup iterator when working on mixed sample types.
#7289 Query Frontend: show warnings from downstream queries.
#7219 Receive: add --remote-write.client-tls-secure and --remote-write.client-tls-skip-verify flags to stop relying on grpc server config to determine grpc client secure/skipVerify.
#7297 *: mark as not queryable if status is not ready
#7302 Considering the X-Forwarded-For header for the remote address in the logs.
#6756 Query: Add query.enable-tenancy & query.tenant-label-name options to allow enforcement of tenancy on the query path, by injecting labels into queries (uses prom-label-proxy internally).
#6944 Receive: Added a new flag for maximum retention bytes.
#6891 Objstore: Bump objstore which adds support for Azure Workload Identity.
#6453 Sidecar: Added --reloader.method to support configuration reloads via SIHUP signal.
#6925 Store Gateway: Support float native histogram.
#6954 Index Cache: Support tracing for fetch APIs.
#6943 Ruler: Added keep_firing_for field in alerting rule.
#6972 Store Gateway: Apply series limit when streaming series for series actually matched if lazy postings is enabled.
#6984 Store Gateway: Added --store.index-header-lazy-download-strategy to specify how to lazily download index headers when lazy mmap is enabled.
#6887 Query Frontend: breaking ⚠️ Add tenant label to relevant exported metrics. Note that this change may cause some pre-existing custom dashboard queries to be incorrect due to the added label.
#7028 Query|Query Frontend: Add new --query-frontend.enable-x-functions flag to enable experimental extended functions.
#6884 Tools: Add upload-block command to upload blocks to object storage.
#7010 Cache: Added set_async_circuit_breaker_* to utilize the circuit breaker pattern for dynamically thresholding asynchronous set operations.
#7014 *: breaking ⚠️ Removed experimental query pushdown feature to simplify query path. This feature has had high complexity for too little benefits. The responsibility for query pushdown will be moved to the distributed mode of the new ’thanos’ promql engine.
#6891 Objstore: Bump objstore which adds support for Azure Workload Identity.
#6605 Query Frontend: Support vertical sharding binary expression with metric name when no matching labels specified.
#6308 Ruler: Support configuration flag that allows customizing template for alert message.
#6760 Query Frontend: Added TLS support in --query-frontend.downstream-tripper-config and --query-frontend.downstream-tripper-config-file
#7004 Query Frontend: Support documented auto discovery for memcached
#6749 Store Gateway: Added thanos_store_index_cache_fetch_duration_seconds histogram for tracking latency of fetching data from index cache.
#6690 Store: breaking ⚠️ Add tenant label to relevant exported metrics. Note that this change may cause some pre-existing dashboard queries to be incorrect due to the added label.
#6530 / #6690 Query: Add command line arguments for configuring tenants and forward tenant information to Store Gateway.
#6765 Index Cache: Add enabled_items to index cache config to selectively cache configured items. Available item types are Postings, Series and ExpandedPostings.
#6773 Index Cache: Add ttl to control the ttl to store items in remote index caches like memcached and redis.
#6794 Query: breaking ⚠️ Add tenant label to relevant exported metrics. Note that this change may cause some pre-existing custom dashboard queries to be incorrect due to the added label.
#6847 Store: Add thanos_bucket_store_indexheader_download_duration_seconds and thanos_bucket_store_indexheader_load_duration_seconds metrics for tracking latency of downloading and initializing the index-header.
#6698 Receive: Change write log level from warn to info.
#6753 mixin(Rule): breaking ⚠️ Fixed the mixin rules with duplicate names and updated the promtool version from v0.37.0 to v0.47.0
#6772 *: Bump prometheus to v0.47.2-0.20231006112807-a5a4eab679cc
#6794 Receive: the exported HTTP metrics now uses the specified default tenant for requests where no tenants are found.
#6651 *: Update go_grpc_middleware to v2.0.0. Remove Tags Interceptor from Thanos. Tags interceptor is removed from v2.0.0 go-grpc-middleware and is not needed anymore.
#6503 *: Change the engine behind ContentPathReloader to be completely independent of any filesystem concept. This effectively fixes this configuration reload when used with Kubernetes ConfigMaps, Secrets, or other volume mounts.
#6456 Store: fix crash when computing set matches from regex pattern
#6427 Receive: increased log level for failed uploads to error
#6172 query-frontend: return JSON formatted errors for invalid PromQL expression in the split by interval middleware.
#6049 Compact: breaking ⚠️ Replace group with resolution in compact metrics to avoid cardinality explosion on compact metrics for large numbers of groups.
#6168 Receiver: Make ketama hashring fail early when configured with number of nodes lower than the replication factor.
#6201 Query-Frontend: Disable absent and absent_over_time for vertical sharding.
#6212 Query-Frontend: Disable scalar for vertical sharding.
#6107breaking ⚠️ Change default user id in container image from 0(root) to 1001
#6228 Conditionally generate debug messages in ProxyStore to avoid memory bloat.
#6231 mixins: Add code/grpc-code dimension to error widgets.
#6244 mixin(Rule): Add rule evaluation failures to the Rule dashboard.
#6303 Store: added and start using streamed snappy encoding for postings list instead of block based one. This leads to constant memory usage during decompression. This approximately halves memory usage when decompressing a postings list in index cache.
#6071 Query Frontend: breaking ⚠️ Add experimental native histogram support for which we updated and aligned with the Prometheus common model, which is used for caching so a cache reset required.
#6163 Receiver: changed default max backoff from 30s to 5s for forwarding requests. Can be configured with --receive-forward-max-backoff.
#6327 *: breaking ⚠️ Use histograms instead of summaries for instrumented handlers.
#6322 Logging: Avoid expensive log.Valuer evaluation for disallowed levels.
#6358 Query: Add +Inf bucket to query duration metrics
#6363 Store: Check context error when expanding postings.
#6405 Index Cache: Change postings cache key to include the encoding format used so that older Thanos versions would not try to decode it during the deployment of a new version.
#6479 Store: breaking ⚠️ Rename thanos_bucket_store_cached_series_fetch_duration_seconds to thanos_bucket_store_series_fetch_duration_seconds and thanos_bucket_store_cached_postings_fetch_duration_seconds to thanos_bucket_store_postings_fetch_duration_seconds.
#6474 Store/Compact: Reduce a large amount of Exists API calls against object storage when synchronizing meta files in favour of a recursive Iter call.
#5990 Cache/Redis: add support for Redis Sentinel via new option master_name.
#6008 *: Add counter metric gate_queries_total to gate.
#5926 Receiver: Add experimental string interning in writer. Can be enabled with a hidden flag --writer.intern.
#5773 Store: Support disabling cache index header file by setting --no-cache-index-header. When toggled, Stores can run without needing persistent disks.
#5653 Receive: Allow setting hashing algorithm per tenant in hashrings config.
#6074 *: Add histogram metrics thanos_store_server_series_requested and thanos_store_server_chunks_requested to all Stores.
#6074 *: Allow configuring series and sample limits per Series request for all Stores.
NOTE: Querier’s query.promql-engine flag enabling new PromQL engine is now unhidden. We encourage users to use new experimental PromQL engine for efficiency reasons.
#5716 DNS: Fix miekgdns resolver LookupSRV to work with CNAME records.
#5844 Query Frontend: Fixes @ modifier time range when splitting queries by interval.
#5854 Query Frontend: lookback_delta param is now handled in query frontend.
#5860 Query: Fixed bug of not showing query warnings in Thanos UI.
#5856 Store: Fixed handling of debug logging flag.
#5230 Rule: Stateless ruler support restoring for state from query API servers. The query API servers should be able to access the remote write storage.
#5880 Query Frontend: Fixes some edge cases of query sharding analysis.
#5893 Cache: Fixed redis client not respecting SetMultiBatchSize config value.
#5966 Query: Fixed mint and maxt when selecting series for the api/v1/series HTTP endpoint.
#5814 Store: Added metric thanos_bucket_store_postings_size_bytes that shows the distribution of how many postings (in bytes) were needed for each Series() call in Thanos Store. Useful for determining limits.
#5703 StoreAPI: Added hash field to series’ chunks. Store gateway and receive implements that field and proxy leverage that for quicker deduplication.
#5801 Store: Added a new flag --store.grpc.downloaded-bytes-limit that limits the number of bytes downloaded in each Series/LabelNames/LabelValues call. Use thanos_bucket_store_postings_size_bytes for determining the limits.
#5836 Receive: Added hidden flag tsdb.memory-snapshot-on-shutdown to enable experimental TSDB feature to snapshot on shutdown. This is intended to speed up receiver restart.
#5839 Receive: Added parameter --tsdb.out-of-order.time-window to set time window for experimental out-of-order samples ingestion. Disabled by default (set to 0s). Please note if you enable this option and you use compactor, make sure you set the --enable-vertical-compaction flag, otherwise you might risk compactor halt.
#5889 Query Frontend: Added support for vertical sharding label_replace and label_join functions.
#5819 Store: Added a few objectives for Store’s data summaries (touched/fetched amount and sizes). They are: 50, 95, and 99 quantiles.
#5837 Store: Added streaming retrieval of series from object storage.
#5940 Objstore: Support for authenticating to Swift using application credentials.
#5945 Tools: Added new no-downsample marker to skip blocks when downsampling via thanos tools bucket mark --marker=no-downsample-mark.json. This will skip downsampling for blocks with the new marker.
#5977 Tools: Added remove flag on bucket mark command to remove deletion, no-downsample or no-compact markers on the block
#5785 Query: thanos_store_nodes_grpc_connections now trimms external_labels label name longer than 1000 character. It also allows customizations in what labels to preserve using query.conn-metric.label flag.
#5542 Mixin: Added query concurrency panel to Querier dashboard.
#5593 Cache: switch Redis client to Rueidis. Rueidis is faster and provides client-side caching. It is highly recommended to use it so that repeated requests for the same key would not be needed.
#5896 *: Upgrade Prometheus to v0.40.7 without implementing native histogram support. Querying native histograms will fail with Error executing query: invalid chunk encoding "<unknown>" and native histograms in write requests are ignored.
#5909 Receive: Compact tenant head after no appends have happened for 1.5 tsdb.max-block-size.
#5838 Mixin: Added data touched type to Store dashboard.
#5922 Compact: Retry on clean, partial marked errors when possible.
#5799 Query Frontend: Fixed sharding behaviour for vector matches. Now queries with sharding should work properly where the query looks like: foo and without (lbl) bar.
#5565 Receive: Allow remote write request limits to be defined per file and tenant (experimental).
#5654 Query: add --grpc-compression flag that controls the compression used in gRPC client. With the flag it is now possible to compress the traffic between Query and StoreAPI nodes - you get lower network usage in exchange for a bit higher CPU/RAM usage.
#5650 Query Frontend: Add sharded queries metrics. thanos_frontend_sharding_middleware_queries_total shows how many queries were sharded or not sharded.
#5658 Query Frontend: Introduce new optional parameters (query-range.min-split-interval, query-range.max-split-interval, query-range.horizontal-shards) to implement more dynamic horizontal query splitting.
#5721 Store: Add metric thanos_bucket_store_empty_postings_total for number of empty postings when fetching series.
#5255 Query: Use k-way merging for the proxying logic. The proxying sub-system now uses much less resources (~25-80% less CPU usage, ~30-50% less RAM usage according to our benchmarks). Reduces query duration by a few percent on queries with lots of series.
#5690 Compact: update --debug.accept-malformed-index flag to apply to downsampling. Previously the flag only applied to compaction, and fatal errors would still occur when downsampling was attempted.
#5707 Objstore: Update objstore to latest version which includes a refactored Azure Storage Account implementation with a new SDK.
#5641 Store: Remove hardcoded labels in shard matcher.
#5641 Query: Inject unshardable le label in query analyzer.
#5685 Receive: Make active/head series limiting configuration per tenant by adding it to new limiting config.
#5411 Tracing: Change Jaeger exporter from OpenTracing to OpenTelemetry. Options RPC Metrics, Gen128Bit and Disabled are now deprecated and won’t have any effect when set ⚠️.
#5573 Sidecar: Added --prometheus.get_config_interval and --prometheus.get_config_timeout allowing to configure parameters for getting Prometheus config.
#5440 HTTP metrics: export number of in-flight HTTP requests.
#5424 Receive: Export metrics regarding size of remote write requests.
#5420 Receive: Automatically remove stale tenants.
#5472 Receive: Add new tenant metrics to example dashboard.
#5475 Compact/Store: Added --block-files-concurrency allowing to configure number of go routines for downloading and uploading block files during compaction.
#5470 Receive: Expose TSDB stats as metrics for all tenants.
#5493 Compact: Added --compact.blocks-fetch-concurrency allowing to configure number of goroutines for downloading blocks during compactions.
#5480 Query: Expose endpoint info timeout as a hidden flag --endpoint.info-timeout.
#5527 Receive: Add per request limits for remote write. Added four new hidden flags --receive.write-request-limits.max-size-bytes, --receive.write-request-limits.max-series, --receive.write-request-limits.max-samples and --receive.write-request-limits.max-concurrency for limiting requests max body size, max amount of series, max amount of samples and max amount of concurrent requests.
#5520 Receive: Meta-monitoring based active series limiting (experimental). This mode is only available if Receiver is in Router or RouterIngestor mode, and config is provided. Added four new hidden flags receive.tenant-limits.max-head-series for the max active series for the tenant, receive.tenant-limits.meta-monitoring-url for the Meta-monitoring URL, receive.tenant-limits.meta-monitoring-query for specifying the PromQL query to execute and receive.tenant-limits.meta-monitoring-client for specifying HTTP client configs.
#5555 Query: Added --query.active-query-path flag, allowing the user to configure the directory to create an active query tracking file, queries.active, for different resolution.
#5566 Receive: Added experimental support to enable chunk write queue via --tsdb.write-queue-size flag.
#5575 Receive: Add support for gRPC compression with snappy.
#4838 Tracing: Chanced client for Stackdriver which deprecated “type: STACKDRIVER” in tracing YAML configuration. Use type: GOOGLE_CLOUD instead (STACKDRIVER type remains for backward compatibility).
#5170 All: Upgraded the TLS version from TLS1.2 to TLS1.3.
#5205 Rule: Add ruler labels as external labels in stateless ruler mode.
#5206 Cache: Add timeout for groupcache’s fetch operation.
#5218 Tools: Thanos tools bucket downsample is now running continuously.
#5231 Tools: Bucket verify tool ignores blocks with deletion markers.
#5244 Query: Promote negative offset and @ modifier to stable features as per Prometheus #10121.
#5255 InfoAPI: Set store API unavailable when stores are not ready.
#5153 Receive: option to extract tenant from client certificate
#5110 Block: Do not upload DebugMeta files to obj store.
#4963 Compactor, Store, Tools: Loading block metadata now only filters out duplicates within a source (or compaction group if replica labels are configured), and does so in parallel over sources.
#5089 S3: Create an empty map in the case SSE-KMS is used and no KMSEncryptionContext is passed.
#4970 Tools tools bucket ls: Added a new flag exclude-delete to exclude blocks marked for deletion.
#4903 Compactor: Added tracing support for compaction.
#4909 Compactor: Add flag –max-time / –min-time to filter blocks that are ready to be compacted.
#4942 Tracing: add traceid_128bit support for jaeger.
#4917 Query: add initial query pushdown for a subset of aggregations. Can be enabled with --enable-feature=query-pushdown on Thanos Query.
#4612 Sidecar: add --prometheus.http-client and --prometheus.http-client-file flag for sidecar to connect Prometheus with basic auth or TLS.
#4847 Query: add --alert.query-url which is used in the user interface for rules/alerts pages. By default the HTTP listen address is used for this URL.
#4848 Compactor: added Prometheus metric for tracking the progress of retention.
#4874 Query: Add --endpoint-strict flag to statically configure Thanos API server endpoints. It is similar to --store-strict but supports passing any Thanos gRPC APIs: StoreAPI, MetadataAPI, RulesAPI, TargetsAPI and ExemplarsAPI.
#4868 Rule: Support ruleGroup limit introduced by Prometheus v2.31.0.
#4897 Query: Add validation for querier address flags.
#4714 EndpointSet: Do not use unimplemented yet new InfoAPI to obtain metadata (avoids unnecessary HTTP roundtrip, instrumentation/alerts spam and logs).
#4594 Reloader: Expose metrics in config reloader to give info on the last operation.
#4619 Tracing: Added consistent tags to Series call from Querier about number important series statistics: processed.series, processed.samples, processed.samples and processed.bytes. This will give admin idea of how much data each component processes per query.
#4623 Query-frontend: Make HTTP downstream tripper (client) configurable via parameters --query-range.downstream-tripper-config and --query-range.downstream-tripper-config-file. If your downstream URL is localhost or 127.0.0.1 then it is strongly recommended to bump max_idle_conns_per_host to at least 100 so that query-frontend could properly use HTTP keep-alive connections and thus reduce the latency of query-frontend by about 20%.
#4519 Query: Switch to miekgdns DNS resolver as the default one.
#4586 Update Prometheus/Cortex dependencies and implement LabelNames() pushdown as a result; provides massive speed-up for the labels API in Thanos Query.
#4421breaking ⚠️: --store (in the future, to be renamed to --endpoints) now supports passing any APIs from Thanos gRPC APIs: StoreAPI, MetadataAPI, RulesAPI, TargetsAPI and ExemplarsAPI (in oppose in the past you have to put it in hidden --targets, --rules etc flags). --store will now automatically detect what APIs server exposes.
#4327 Add environment variable substitution to all YAML configuration flags.
#4239 Add penalty based deduplication mode for compactor.
#4292 Receive: Enable exemplars ingestion and querying.
#4392 Tools: Added --delete-blocks to bucket rewrite tool to mark the original blocks for deletion after rewriting is done.
#3970 Azure: Adds more configuration options for Azure blob storage. This allows for pipeline and reader specific configuration. Implements HTTP transport configuration options. These options allows for more fine-grained control on timeouts and retries. Implements MSI authentication as second method of authentication via a service principal token.
#4406 Tools: Add retention command for applying retention policy on the bucket.
#4430 Compact: Add flag downsample.concurrency to specify the concurrency of downsampling blocks.
#4384 Fix the experimental PromQL editor when used on multiple line.
#4342 ThanosSidecarUnhealthy doesn’t fire if the sidecar is never healthy
#4388 Receive: fix bug in forwarding remote-write requests within the hashring via gRPC when TLS is enabled on the HTTP server but not on the gRPC server.
#4442 Ruler: fix SIGHUP reload signal not working.
#4354 Receive: use the S2 library for decoding Snappy data; saves about 5-7% of CPU time in the Receive component when handling incoming remote write requests
#4117 Mixin: new alert ThanosReceiveTrafficBelowThreshold to flag if the ingestion average of the last hour dips below 50% of the ingestion average for the last 12 hours.
#4107 Store: LabelNames and LabelValues now support label matchers.
#3940 Sidecar: Added matchers support to LabelValues
#4171 Docker: Busybox image updated to latest (1.33.1)
#4175 Added Tag Configuration Support Lightstep Tracing
#4176 Query API: Adds optional Stats param to return stats for query APIs
#4125 Rule: Add --alert.relabel-config / --alert.relabel-config-file allowing to specify alert relabel configurations like Prometheus
#4211 Add TLS and basic authentication to Thanos APIs
#3700 Compact/Web: Make old bucket viewer UI work with vanilla Prometheus blocks.
#3657 *: It’s now possible to configure HTTP transport options for S3 client.
#3752 Compact/Store: Added --block-meta-fetch-concurrency allowing to configure number of go routines for block metadata synchronization.
#3723 Query Frontend: Added --query-range.request-downsampled flag enabling additional queries for downsampled data in case of empty or incomplete response to range request.
#3579 Cache: Added inmemory cache for caching bucket.
#3792 Receiver: Added --tsdb.allow-overlapping-blocks flag to allow overlapping tsdb blocks and enable vertical compaction.
#3740 Query: Added --query.default-step flag to set default step. Useful when your tenant scrape interval is stable and far from default UI’s 1s.
#3686 Query/Sidecar: Added metric metadata API support. You can now configure you Querier to fetch Prometheus metrics metadata from leaf Prometheus-es!
#3031 Compact/Sidecar/Receive/Rule: Added --hash-func. If some function has been specified, writers calculate hashes using that function of each file in a block before uploading them. If those hashes exist in the meta.json file then Compact does not download the files if they already exist on disk and with the same hash. This also means that the data directory passed to Thanos Compact is only cleared once at boot or if everything succeeds. So, if you, for example, use persistent volumes on k8s and your Thanos Compact crashes or fails to make an iteration properly then the last downloaded files are not wiped from the disk. The directories that were created the last time are only wiped again after a successful iteration or if the previously picked up blocks have disappeared.
#3705 Store: Fix race condition leading to failing queries or possibly incorrect query results.
#3661 Compact: Deletion-mark.json is deleted as the last one, which could in theory lead to potential store gateway load or query error for such in-deletion block.
#3760 Store: Fix panic caused by a race condition happening on concurrent index-header reader usage and unload, when --store.enable-index-header-lazy-reader is enabled.
#3759 Store: Fix panic caused by a race condition happening on concurrent index-header lazy load and unload, when --store.enable-index-header-lazy-reader is enabled.
#3773 Compact: Fixed compaction planner size check, making sure we don’t create too large blocks.
#3814 Store: Decreased memory utilisation while fetching block’s chunks.
#3815 Receive: Improve handling of empty time series from clients
#3795 s3: A truncated “get object” response is reported as error.
#3899 Receive: Correct the inference of client gRPC configuration.
#3943 Receive: Fixed memory regression introduced in v0.17.0.
#3960 Query: Fixed deduplication of equal alerts with different labels.
#3380 Mixin: Add block deletion panels for compactor dashboards.
#3568 Store: Optimized inject label stage of index lookup.
#3566 StoreAPI: Support label matchers in labels API.
#3531 Store: Optimized common cases for time selecting smaller amount of series by avoiding looking up symbols.
#3469 StoreAPI: Added hints field to LabelNamesRequest and LabelValuesRequest. Hints are an opaque data structure that can be used to carry additional information from the store and its content is implementation-specific.
#3421 Tools: Added thanos tools bucket rewrite command allowing to delete series from given block.
#3509 Store: Added a CLI flag to limit the number of series that are touched.
#3444 Query Frontend: Make POST request to downstream URL for labels and series API endpoints.
#3388 Tools: Bucket replicator now can specify block IDs to copy.
#3385 Tools: Bucket prints extra statistics for block index with debug log-level.
#3121 Receive: Added --receive.hashrings alternative to receive.hashrings-file flag (lower priority). The flag expects the literal hashring configuration in JSON format.
#3496 S3: Respect SignatureV2 flag for all credential providers.
#2732 Swift: Switched to a new library ncw/swift providing large objects support. By default, segments will be uploaded to the same container directory segments/ if the file is bigger than 1GB. To change the defaults see the docs.
#3626 Shipper: Failed upload of meta.json file doesn’t cause block cleanup anymore. This has a potential to generate corrupted blocks under specific conditions. Partial block is left in bucket for later cleanup.
#3532 compact: do not cleanup blocks on boot. Reverts the behavior change introduced in #3115 as in some very bad cases the boot of Thanos Compact took a very long time since there were a lot of blocks-to-be-cleaned.
#3520 Fix index out of bound bug when comparing ZLabelSets.
#3259 Thanos BlockViewer: Added a button in the blockviewer that allows users to download the metadata of a block.
#3261 Thanos Store: Use segment files specified in meta.json file, if present. If not present, Store does the LIST operation as before.
#3276 Query Frontend: Support query splitting and retry for label names, label values and series requests.
#3315 Query Frontend: Support results caching for label names, label values and series requests.
#3346 Ruler UI: Fix a bug preventing the /rules endpoint from loading.
#3115 compact: now deletes partially uploaded and blocks with deletion marks concurrently. It does that at the beginning and then every --compact.cleanup-interval time period. By default it is 5 minutes.
#3312 s3: add list_objects_version config option for compatibility.
#3356 Query Frontend: Add a flag to disable step alignment middleware for query range.
#3378 Ruler: added the ability to send queries via the HTTP method POST. Helps when alerting/recording rules are extra long because it encodes the actual parameters inside of the body instead of the URI. Thanos Ruler now uses POST by default unless --query.http-method is set GET.
#3381 Querier UI: Add ability to enable or disable metric autocomplete functionality.
#2979 Replicator: Add the ability to replicate blocks within a time frame by passing –min-time and –max-time
#3277 Thanos Query: Introduce dynamic lookback interval. This allows queries with large step to make use of downsampled data.
#3409 Compactor: Added support for no-compact-mark.json which excludes the block from compaction.
#3245 Query Frontend: Add query-frontend.org-id-header flag to specify HTTP header(s) to populate slow query log (e.g. X-Grafana-User).
#3431 Store: Added experimental support to lazy load index-headers at query time. When enabled via --store.enable-index-header-lazy-reader flag, the store-gateway will load into memory an index-header only once it’s required at query time. Index-header will be automatically released after --store.index-header-lazy-reader-idle-timeout of inactivity.
This, generally, reduces baseline memory usage of store when inactive, as well as a total number of mapped files (which is limited to 64k in some systems.
#3437 StoreAPI: Added hints field to LabelNamesResponse and LabelValuesResponse. Hints in an opaque data structure that can be used to carry additional information from the store and its content is implementation specific.
This, generally, reduces baseline memory usage of store when inactive, as well as a total number of mapped files (which is limited to 64k in some systems.
#3415 Tools: Added thanos tools bucket mark command that allows to mark given block for deletion or for no-compact
#3452 Store: Index cache posting compression is now enabled by default. Removed experimental.enable-index-cache-postings-compression flag.
#3410 Compactor: Changed metric thanos_compactor_blocks_marked_for_deletion_total to thanos_compactor_blocks_marked_total with marker label. Compactor will now automatically disable compaction for blocks with large index that would output blocks after compaction larger than specified value (by default: 64GB). This automatically handles the Promethus format limit.
#2906 Tools: Refactor Bucket replicate execution. Removed all thanos_replicate_origin_.* metrics.
thanos_replicate_origin_meta_loads_total can be replaced by blocks_meta_synced{state="loaded"}.
thanos_replicate_origin_partial_meta_reads_total can be replaced by blocks_meta_synced{state="failed"}.
#3309 Compact: breaking ⚠️ Rename metrics to match naming convention. This includes metrics starting with thanos_compactor to thanos_compact, thanos_querier to thanos_query and thanos_ruler to thanos_rule.
New Thanos component, Query Frontend has more options and supports shared cache (currently: Memcached).
Added debug mode in Thanos UI that allows to filter Stores to query from by their IPs from Store page (!). This helps enormously in e.g debugging the slowest store etc. All raw Thanos API allows passing storeMatch[] arguments with __address__ matchers.
#3147 Querier: Added query.metadata.default-time-range flag to specify the default metadata time range duration for retrieving labels through Labels and Series API when the range parameters are not specified. The zero value means range covers the time since the beginning.
#3207 Query Frontend: Added cache-compression-type flag to use compression in the query frontend cache.
#3122 *: All Thanos components have now /debug/fgprof endpoint on HTTP port allowing to get off-CPU profiles as well.
#3109 Query Frontend: Added support for Cache-Control HTTP response header which controls caching behaviour. So far no-store value is supported and it makes the response skip cache.
#3092 Tools: Added tools bucket cleanup CLI tool that deletes all blocks marked to be deleted.
#3154 Store: breaking Renamed metric thanos_bucket_store_queries_concurrent_max to thanos_bucket_store_series_gate_queries_max.
#3179 Store: context.Canceled will not increase thanos_objstore_bucket_operation_failures_total.
#3136 Sidecar: Improved detection of directory changes for Prometheus config.
breaking Added metric thanos_sidecar_reloader_config_apply_operations_total and rename metric thanos_sidecar_reloader_config_apply_errors_total to thanos_sidecar_reloader_config_apply_operations_failed_total.
#3022 *: Thanos images are now build with Go 1.15.
#2665 Swift: Fix issue with missing Content-Type HTTP headers.
#2800 Query: Fix handling of --web.external-prefix and --web.route-prefix.
#2834 Query: Fix rendered JSON state value for rules and alerts should be in lowercase.
#2866 Receive, Querier: Fixed leaks on receive and querier Store API Series, which were leaking on errors.
#2937 Receive: Fixing auto-configuration of --receive.local-endpoint.
#2895 Compact: Fix increment of thanos_compact_downsample_total metric for downsample of 5m resolution blocks.
#2858 Store: Fix --store.grpc.series-sample-limit implementation. The limit is now applied to the sum of all samples fetched across all queried blocks via a single Series call, instead of applying it individually to each block.
#2936 Compact: Fix ReplicaLabelRemover panic when replicaLabels are not specified.
#2956 Store: Fix fetching of chunks bigger than 16000 bytes.
#2970 Store: Upgrade minio-go/v7 to fix slowness when running on EKS.
#2957 Rule: breaking ⚠️ Now sets all of the relevant fields properly; avoids a panic when /api/v1/rules is called and the time zone is not UTC; rules field is an empty array now if no rules have been defined in a rule group. Thanos Rule’s /api/v1/rules endpoint no longer returns the old, deprecated partial_response_strategy. The old, deprecated value has been fixed to WARN for quite some time. Please use partialResponseStrategy.
#2976 Query: Better rounding for incoming query timestamps.
#2929 Mixin: Fix expression for ‘unhealthy sidecar’ alert and increase the timeout for 10 minutes.
#3024 Query: Consider group name and file for deduplication.
#3012 Ruler,Receiver: Fix TSDB to delete blocks in atomic way.
#3046 Ruler,Receiver: Fixed framing of StoreAPI response, it was one chunk by one.
#3095 Ruler: Update the manager when all rule files are removed.
#3105 Querier: Fix overwriting maxSourceResolution when auto downsampling is enabled.
#3010 Querier: Added --query.lookback-delta flag to override the default lookback delta in PromQL. The flag should be lookback delta should be set to at least 2 times of the slowest scrape interval. If unset it will use the PromQL default of 5m.
#2926 API: Add new blocks HTTP API to serve blocks metadata. The status endpoints (/api/v1/status/flags, /api/v1/status/runtimeinfo and /api/v1/status/buildinfo) are now available on all components with a HTTP API.
#2892 Receive: Receiver fails when the initial upload fails.
#2980 Bucket Viewer: Migrate block viewer to React.
#2725 Add bucket index operation durations: thanos_bucket_store_cached_series_fetch_duration_seconds and thanos_bucket_store_cached_postings_fetch_duration_seconds.
#2931 Query: Allow passing a storeMatch[] to select matching stores when debugging the querier. See documentation
#2893 Store: Rename metric thanos_bucket_store_cached_postings_compression_time_seconds to thanos_bucket_store_cached_postings_compression_time_seconds_total.
#2915 Receive,Ruler: Enable TSDB directory locking by default. Add a new flag (--tsdb.no-lockfile) to override behavior.
#2902 Querier UI:Separate dedupe and partial response checkboxes per panel in new UI.
#2991 Store: breaking ⚠️operation label value getrange changed to get_range for thanos_store_bucket_cache_operation_requests_total and thanos_store_bucket_cache_operation_hits_total to be consistent with bucket operation metrics.
#2876 Receive,Ruler: Updated TSDB and switched to ChunkIterators instead of sample one, which avoids unnecessary decoding / encoding.
#3064 s3: breaking ⚠️ Add SSE/SSE-KMS/SSE-C configuration. The S3 encrypt_sse: true option is now deprecated in favour of sse_config. If you used encrypt_sse, the migration strategy is to set up the following block:
#2548 Query: Fixed rare cases of double counter reset accounting when querying rate with deduplication enabled.
#2536 S3: Fixed AWS STS endpoint url to https for Web Identity providers on AWS EKS.
#2501 Query: Gracefully handle additional fields in SeriesResponse protobuf message that may be added in the future.
#2568 Query: Don’t close the connection of strict, static nodes if establishing a connection had succeeded but Info() call failed.
#2615 Rule: Fix bugs where rules were out of sync.
#2614 Tracing: Disabled Elastic APM Go Agent default tracer on initialization to disable the default metric gatherer.
#2525 Query: Fixed logging for dns resolution error in the Query component.
#2484 Query/Ruler: Fixed issue #2483, when web.route-prefix is set, it is added twice in HTTP router prefix.
#2416 Bucket: Fixed issue #2416 bug in inspect --sort-by doesn’t work correctly in all cases.
#2719 Query: irate and resets use now counter downsampling aggregations.
#2705 minio-go: Added support for af-south-1 and eu-south-1 regions.
#2753 Sidecar, Receive, Rule: Fixed possibility of out of order uploads in error cases. This could potentially cause Compactor to create overlapping blocks.
#2012 Receive: Added multi-tenancy support (based on header)
#2502 StoreAPI: Added hints field to SeriesResponse. Hints in an opaque data structure that can be used to carry additional information from the store and its content is implementation specific.
#2521 Sidecar: Added thanos_sidecar_reloader_reloads_failed_total, thanos_sidecar_reloader_reloads_total, thanos_sidecar_reloader_watch_errors_total, thanos_sidecar_reloader_watch_events_total and thanos_sidecar_reloader_watches metrics.
#2412 UI: Added React UI from Prometheus upstream. Currently only accessible from Query component as only /graph endpoint is migrated.
#2532 Store: Added hidden option --store.caching-bucket.config=<yaml content> (or --store.caching-bucket.config-file=<file.yaml>) for experimental caching bucket, that can cache chunks into shared memcached. This can speed up querying and reduce number of requests to object storage.
#2579 Store: Experimental caching bucket can now cache metadata as well. Config has changed from #2532.
#2526 Compact: In case there are no labels left after deduplication via --deduplication.replica-label, assign first replica-label with value deduped.
#2621 Receive: Added flag to configure forward request timeout. Receive write will complete request as soon as quorum of writes succeeds.
#2513 Tools: Moved thanos bucket commands to thanos tools bucket, also moved thanos check rules to thanos tools rules-check. thanos tools rules-check also takes rules by --rules repeated flag not argument anymore.
#2548 Store, Querier: remove duplicated chunks on StoreAPI.
Receive,Rule: TSDB now supports isolation of append and queries.
Receive,Rule: TSDB now holds less WAL files after Head Truncation.
#2450 Store: Added Regex-set optimization for label=~"a|b|c" matchers.
#2526 Compact: In case there are no labels left after deduplication via --deduplication.replica-label, assign first replica-label with value deduped.
#2603 Store/Querier: Significantly optimize cases where StoreAPIs or blocks returns exact overlapping chunks (e.g Store GW and sidecar or brute force Store Gateway HA).
#2252 Query: add new --store-strict flag. More information available here.
#2265 Compact: add --wait-interval to specify compaction wait interval between consecutive compact runs when --wait is enabled.
#2250 Compact: enable vertical compaction for offline deduplication (experimental). Uses --deduplication.replica-label flag to specify the replica label on which to deduplicate (hidden). Please note that this uses a NAIVE algorithm for merging (no smart replica deduplication, just chaining samples together). This works well for deduplication of blocks with precisely the same samples like those produced by Receiver replication. We plan to add a smarter algorithm in the following weeks.
#1714 Compact: the compact component now exposes the bucket web UI when it is run as a long-lived process.
#2304 Store: added max_item_size configuration option to memcached-based index cache. This should be set to the max item size configured in memcached (-I flag) in order to not waste network round-trips to cache items larger than the limit configured in memcached.
#2297 Store: add --experimental.enable-index-cache-postings-compression flag to enable re-encoding and compressing postings before storing them into the cache. Compressed postings take about 10% of the original size.
#2357 Compact and Store: the compact and store components now serve the bucket UI on :<http-port>/loaded, which shows exactly the blocks that are currently seen by compactor and the store gateway. The compactor also serves a different bucket UI on :<http-port>/global, which shows the status of object storage without any filters.
#2172 Store: add support for sharding the store component based on the label hash.
#2113 Bucket: added thanos bucket replicate command to replicate blocks from one bucket to another.
#1922 Docs: create a new document to explain sharding in Thanos.
#2136breaking Store, Compact, Bucket: schedule block deletion by adding deletion-mark.json. This adds a consistent way for multiple readers and writers to access object storage. Since there are no consistency guarantees provided by some Object Storage providers, this PR adds a consistent lock-free way of dealing with Object Storage irrespective of the choice of object storage. In order to achieve this co-ordination, blocks are not deleted directly. Instead, blocks are marked for deletion by uploading the deletion-mark.json file for the block that was chosen to be deleted. This file contains Unix time of when the block was marked for deletion. If you want to keep existing behavior, you should add --delete-delay=0s as a flag.
#2090breaking Downsample command: the downsample command has moved and is now a sub-command of the thanos bucket sub-command; it cannot be called via thanos downsample any more.
#2294 Store: optimizations for fetching postings. Queries using =~".*" matchers or negation matchers (!=... or !~...) benefit the most.
#2301 Ruler: exit with an error when initialization fails.
#2310 Query: report timespan 0 to 0 when discovering no stores.
#2330 Store: index-header is no longer experimental. It is enabled by default for store Gateway. You can disable it with new hidden flag: --store.disable-index-header. The --experimental.enable-index-header flag was removed.
#1848 Ruler: allow returning error messages when a reload is triggered via HTTP.
#2270 All: Thanos components will now print stack traces when they error out.
#1952 Store Gateway: Implemented binary index header. This significantly reduces resource consumption (memory, CPU, net bandwidth) for startup and data loading processes as well as baseline memory. This means that adding more blocks into object storage, without querying them will use almost no resources. This, however, still means that querying large amounts of data will result in high spikes of memory and CPU use as before, due to simply fetching large amounts of metrics data. Since we fixed baseline, we are now focusing on query performance optimizations in separate initiatives. To enable experimental index-header mode run store with hidden experimental.enable-index-header flag.
#2009 Store Gateway: Minimum age of all blocks before they are being read. Set it to a safe value (e.g 30m) if your object storage is eventually consistent. GCS and S3 are (roughly) strongly consistent.
#1939 Ruler: Add TLS and authentication support for query endpoints with the --query.config and --query.config-file CLI flags. See documentation for further information.
#1982 Ruler: Add support for Alertmanager v2 API endpoints.
#2030 Query: Add thanos_proxy_store_empty_stream_responses_total metric for number of empty responses from stores.
#2049 Tracing: Support sampling on Elastic APM with new sample_rate setting.
#2008 Querier, Receiver, Sidecar, Store: Add gRPC health check endpoints.
#2145 Tracing: track query sent to prometheus via remote read api.
#1970breaking Receive: Use gRPC for forwarding requests between peers. Note that existing values for the --receive.local-endpoint flag and the endpoints in the hashring configuration file must now specify the receive gRPC port and must be updated to be a simple host:port combination, e.g. 127.0.0.1:10901, rather than a full HTTP URL, e.g. http://127.0.0.1:10902/api/v1/receive.
#1933 Add a flag --tsdb.wal-compression to configure whether to enable tsdb wal compression in ruler and receiver.
#2021 Rename metric thanos_query_duplicated_store_address to thanos_query_duplicated_store_addresses_total and thanos_rule_duplicated_query_address to thanos_rule_duplicated_query_addresses_total.
#2166 Bucket Web: improve the tooltip for the bucket UI; it was reconstructed and now exposes much more information about blocks.
#2015 Sidecar: Querier /api/v1/series bug fixed when time range was ignored inside sidecar. The bug was noticeable for example when using Grafana template variables.
#2120 Bucket Web: Set state of status prober properly.
#1919 Compactor: Fixed potential data loss when uploading older blocks, or upload taking long time while compactor is running.
#1937 Compactor: Improved synchronization of meta JSON files. Compactor now properly handles partial block uploads for all operation like retention apply, downsampling and compaction. Additionally:
Removed thanos_compact_sync_meta_* metrics. Use thanos_blocks_meta_* metrics instead.
Added thanos_consistency_delay_seconds and thanos_compactor_aborted_partial_uploads_deletion_attempts_total metrics.
#1936 Store: Improved synchronization of meta JSON files. Store now properly handles corrupted disk cache. Added meta.json sync metrics.
#1856 Receive: close DBReadOnly after flushing to fix a memory leak.
#1882 Receive: upload to object storage as ‘receive’ rather than ‘sidecar’.
#1907 Store: Fixed the duration unit for the metric thanos_bucket_store_series_gate_duration_seconds.
#1931 Compact: Fixed the compactor successfully exiting when actually an error occurred while compacting a blocks group.
#1872 Ruler: /api/v1/rules now shows a properly formatted value
#1945master container images are now built with Go 1.13
#1956 Ruler: now properly ignores duplicated query addresses
#1975 Store Gateway: fixed panic caused by memcached servers selector when there’s 1 memcached node
#1852 Add support for AWS_CONTAINER_CREDENTIALS_FULL_URI by upgrading to minio-go v6.0.44
#1854 Update Rule UI to support alerts count displaying and filtering.
#1838 Ruler: Add TLS and authentication support for Alertmanager with the --alertmanagers.config and --alertmanagers.config-file CLI flags. See documentation for further information.
#1838 Ruler: Add a new --alertmanagers.sd-dns-interval CLI option to specify the interval between DNS resolutions of Alertmanager hosts.
#1881 Store Gateway: memcached support for index cache. See documentation for further information.
#1904 Add a skip-chunks option in Store Series API to improve the response time of /api/v1/series endpoint.
#1910 Query: /api/v1/labels now understands POST - useful for sending bigger requests
#1656 Store Gateway: Store now starts metric and status probe HTTP server earlier in its start-up sequence. /-/healthy endpoint now starts to respond with success earlier. /metrics endpoint starts serving metrics earlier as well. Make sure to point your readiness probes to the /-/ready endpoint rather than /metrics.
#1669 Store Gateway: Fixed store sharding. Now it does not load excluded meta.jsons and load/fetch index-cache.json files.
#1670 Sidecar: Fixed un-ordered blocks upload. Sidecar now uploads the oldest blocks first.
#1568 Store Gateway: Store now retains the first raw value of a chunk during downsampling to avoid losing some counter resets that occur on an aggregation boundary.
#1666 Compact: thanos_compact_group_compactions_total now counts block compactions, so operations that resulted in a compacted block. The old behaviour is now exposed by new metric: thanos_compact_group_compaction_runs_started_total and thanos_compact_group_compaction_runs_completed_total which counts compaction runs overall.
#1694prober_ready and prober_healthy metrics are removed, for sake of status. Now status exposes same metric with a label, check. check can have “healthy” or “ready” depending on status of the probe.
#1632 Removes the duplicated external labels detection on Thanos Querier; warning only; Made Store Gateway compatible with older Querier versions.
NOTE: thanos_store_nodes_grpc_connections metric is now per external_labels and store_type. It is a recommended metric for Querier storeAPIs. thanos_store_node_info is marked as obsolete and will be removed in next release.
NOTE2: Store Gateway is now advertising artificial: "@thanos_compatibility_store_type=store" label. This is to have the current Store Gateway compatible with Querier pre v0.8.0. This label can be disabled by hidden debug.advertise-compatibility-label=false flag on Store Gateway.
Add relabel config (--selector.relabel-config-file and selector.relabel-config) into Thanos Store and Compact components. Selecting blocks to serve depends on the result of block labels relabeling.
For store gateway, advertise labels from “approved” blocks.
#1540 Thanos Downsample added /-/ready and /-/healthy endpoints.
#1538 Thanos Rule added /-/ready and /-/healthy endpoints.
#1537 Thanos Receive added /-/ready and /-/healthy endpoints.
#1460 Thanos Store Added /-/ready and /-/healthy endpoints.
#1534 Thanos Query Added /-/ready and /-/healthy endpoints.
#1533 Thanos inspect now supports the timeout flag.
#1496 Thanos Receive now supports setting block duration.
#1362 Optional replicaLabels param for /query and /query_range querier endpoints. When provided overwrite the query.replica-label cli flags.
#1482 Thanos now supports Elastic APM as tracing provider.
#1362query.replica-label configuration can be provided more than once for multiple deduplication labels like: --query.replica-label=prometheus_replica --query.replica-label=service.
#1581 Thanos Store now can use smaller buffer sizes for Bytes pool; reducing memory for some requests.
#1478 Thanos components now exposes gRPC server metrics as soon as server starts, to provide more reliable data for instrumentation.
#1378 Thanos Receive now exposes thanos_receive_config_hash, thanos_receive_config_last_reload_successful and thanos_receive_config_last_reload_success_timestamp_seconds metrics to track latest configuration change
#1268 Thanos Sidecar added support for newest Prometheus streaming remote read added here. This massively improves memory required by single request for both Prometheus and sidecar. Single requests now should take constant amount of memory on sidecar, so resource consumption prediction is now straightforward. This will be used if you have Prometheus 2.13 or 2.12-master.
#1358 Added part_size configuration option for HTTP multipart requests minimum part size for S3 storage type
#1363 Thanos Receive now exposes thanos_receive_hashring_nodes and thanos_receive_hashring_tenants metrics to monitor status of hash-rings
#1395 Thanos Sidecar added /-/ready and /-/healthy endpoints to Thanos sidecar.
#1297 Thanos Compact added /-/ready and /-/healthy endpoints to Thanos compact.
#1431 Thanos Query added hidden flag to allow the use of downsampled resolution data for instant queries.
#1408 Thanos Store Gateway can now allow the specifying of supported time ranges it will serve (time sharding). Flags: min-time & max-time
#1414 Upgraded important dependencies: Prometheus to 2.12-rc.0. TSDB is now part of Prometheus.
#1380 Upgraded important dependencies: Prometheus to 2.11.1 and TSDB to 0.9.1. Some changes affecting Querier:
[ENHANCEMENT] Query performance improvement: Efficient iteration and search in HashForLabels and HashWithoutLabels. #5707
[ENHANCEMENT] Optimize queries using regexp for set lookups. tsdb#602
[BUGFIX] prometheus_tsdb_compactions_failed_total is now incremented on any compaction failure. tsdb#613
[BUGFIX] PromQL: Correctly display {name=“a”}.
#1338 Thanos Query still warns on store API duplicate, but allows a single one from duplicated set. This is gracefully warn about the problematic logic and not disrupt immediately.
#1385 Thanos Compact exposes flag to disable downsampling downsampling.disable.
#1458 Thanos Query and Receive now use common instrumentation middleware. As as result, for sake of http_requests_total and http_request_duration_seconds_bucket; Thanos Query no longer exposes thanos_query_api_instant_query_duration_seconds, thanos_query_api_range_query_duration_second metrics and Thanos Receive no longer exposes thanos_http_request_duration_seconds, thanos_http_requests_total, thanos_http_response_size_bytes.
#1253 Add support for specifying a maximum amount of retries when using Azure Blob storage (default: no retries).
#1244 Thanos Compact now exposes new metrics thanos_compact_downsample_total and thanos_compact_downsample_failures_total which are useful to catch when errors happen
#1260 Thanos Query/Rule now exposes metrics thanos_querier_store_apis_dns_provider_results and thanos_ruler_query_apis_dns_provider_results which tell how many addresses were configured and how many were actually discovered respectively
#1248 Add a web UI to show the state of remote storage.
#1217 Thanos Receive gained basic hashring support
#1262 Thanos Receive got a new metric thanos_http_requests_total which shows how many requests were handled by it
#1243 Thanos Receive got an ability to forward time series data between nodes. Now you can pass the hashring configuration via --receive.hashrings-file; the refresh interval --receive.hashrings-file-refresh-interval; the name of the local node’s name --receive.local-endpoint; and finally the header’s name which is used to determine the tenant --receive.tenant-header.
#1147 Support for the Jaeger tracer has been added!
breaking New common flags were added for configuring tracing: --tracing.config-file and --tracing.config. You can either pass a file to Thanos with the tracing configuration or pass it in the command line itself. Old --gcloudtrace.* flags were removed ⚠️
To migrate over the old --gcloudtrace.* configuration, your tracing configuration should look like this:
#1284 Add support for multiple label-sets in Info gRPC service. This deprecates the single Labels slice of the InfoResponse, in a future release backward compatible handling for the single set of Labels will be removed. Upgrading to v0.6.0 or higher is advised. breaking If you run have duplicate queries in your Querier configuration with hierarchical federation of multiple Queries this PR makes Thanos Querier to detect this case and block all duplicates. Refer to 0.6.1 which at least allows for single replica to work.
#1314 Removes http_request_duration_microseconds (Summary) and adds http_request_duration_seconds (Histogram) from http server instrumentation used in Thanos APIs and UIs.
#1287 Sidecar now waits on Prometheus’ external labels before starting the uploading process
#1261 Thanos Receive now exposes metrics thanos_http_request_duration_seconds and thanos_http_response_size_bytes properly of each handler
#1274 Iteration limit has been lifted from the LRU cache so there should be no more spam of error messages as they were harmless
#1321 Thanos Query now fails early on a query which only uses external labels - this improves clarity in certain situations
#1227 Some context handling issues were fixed in Thanos Compact; some unnecessary memory allocations were removed in the hot path of Thanos Store.
#1183 Compactor now correctly propagates retriable/haltable errors which means that it will not unnecessarily restart if such an error occurs
#1231 Receive now correctly handles SIGINT and closes without deadlocking
#1278 Fixed inflated values problem with sum() on Thanos Query
#1280 Fixed a problem with concurrent writes to a map in Thanos Query while rendering the UI
#1311 Fixed occasional panics in Compact and Store when using Azure Blob cloud storage caused by lack of error checking in client library.
#1322 Removed duplicated closing of the gRPC listener - this gets rid of harmless messages like store gRPC listener: close tcp 0.0.0.0:10901: use of closed network connection when those programs are being closed
TL;DR: Store LRU cache is no longer leaking, Upgraded Thanos UI to Prometheus 2.9, Fixed auto-downsampling, Moved to Go 1.12.5 and more.
This version moved tarballs to Golang 1.12.5 from 1.11 as well, so same warning applies if you use container_memory_usage_bytes from cadvisor. Use container_memory_working_set_bytes instead.
breaking As announced couple of times this release also removes gossip with all configuration flags (--cluster.*).
#1118breaking swift: Added support for cross-domain authentication by introducing userDomainID, userDomainName, projectDomainID, projectDomainName. The outdated terms tenantID, tenantName are deprecated and have been replaced by projectID, projectName.
⚠️ IMPORTANT ⚠️ This is the last release that supports gossip. From Thanos v0.5.0, gossip will be completely removed.
This release also disables gossip mode by default for all components. See this for more details.
⚠️ This release moves Thanos docker images (NOT artifacts by accident) to Golang 1.12. This release includes change in GC’s memory release which gives following effect:
On Linux, the runtime now uses MADV_FREE to release unused memory. This is more efficient but may result in higher reported RSS. The kernel will reclaim the unused data when it is needed. To revert to the Go 1.11 behavior (MADV_DONTNEED), set the environment variable GODEBUG=madvdontneed=1.
If you want to see exact memory allocation of Thanos process:
Use go_memstats_heap_alloc_bytes metric exposed by Golang or container_memory_working_set_bytes exposed by cadvisor.
Add GODEBUG=madvdontneed=1 before running Thanos binary to revert to memory releasing to pre 1.12 logic.
#910 Query’s stores UI page is now sorted by type and old DNS or File SD stores are removed after 5 minutes (configurable via the new --store.unhealthy-timeout=5m flag).
#905 Thanos support for Query API: /api/v1/labels. Notice that the API was added in Prometheus v2.6.
#798 Ability to limit the maximum number of concurrent request to Series() calls in Thanos Store and the maximum amount of samples we handle.
#1060 Allow specifying region attribute in S3 storage configuration
⚠️ WARNING ⚠️ #798 adds a new default limit to Thanos Store: --store.grpc.series-max-concurrency. Most likely you will want to make it the same as --query.max-concurrent on Thanos Query.
New options:
New Store flags:
* `--store.grpc.series-sample-limit` limits the amount of samples that might be retrieved on a single Series() call. By default it is 0. Consider enabling it by setting it to more than 0 if you are running on limited resources.
* `--store.grpc.series-max-concurrency` limits the number of concurrent Series() calls in Thanos Store. By default it is 20. Considering making it lower or bigger depending on the scale of your deployment.
New Store metrics:
* `thanos_bucket_store_queries_dropped_total` shows how many queries were dropped due to the samples limit;
* `thanos_bucket_store_queries_concurrent_max` is a constant metric which shows how many Series() calls can concurrently be executed by Thanos Store;
* `thanos_bucket_store_queries_in_flight` shows how many queries are currently "in flight" i.e. they are being executed;
* `thanos_bucket_store_gate_duration_seconds` shows how many seconds it took for queries to pass through the gate in both cases - when that fails and when it does not.
New Store tracing span: * store_query_gate_ismyturn shows how long it took for a query to pass (or not) through the gate.
New Querier and Ruler flag: -- store.sd-dns-resolver which allows to specify resolver to use. Either golang or miekgdns
#986 Allow to save some startup & sync time in store gateway as it is no longer needed to compute index-cache from block index on its own for larger blocks. The store Gateway still can do it, but it first checks bucket if there is index-cached uploaded already. In the same time, compactor precomputes the index cache file on every compaction.
New Compactor flag: --index.generate-missing-cache-file was added to allow quicker addition of index cache files. If enabled it precomputes missing files on compactor startup. Note that it will take time and it’s only one-off step per bucket.
#887 Compact: Added new --block-sync-concurrency flag, which allows you to configure number of goroutines to use when syncing block metadata from object storage.
#928 Query: Added --store.response-timeout flag. If a Store doesn’t send any data in this specified duration then a Store will be ignored and partial data will be returned if it’s enabled. 0 disables timeout.
#893 S3 storage backend has graduated to stable maturity level.
#936 Azure storage backend has graduated to stable maturity level.
#937 S3: added trace functionality. You can add trace.enable: true to enable the minio client’s verbose logging.
#953 Compact: now has a hidden flag --debug.accept-malformed-index. Compaction index verification will ignore out of order label names.
#963 GCS: added possibility to inline ServiceAccount into GCS config.
#1010 Compact: added new flag --compact.concurrency. Number of goroutines to use when compacting groups.
#1028 Query: added --query.default-evaluation-interval, which sets default evaluation interval for sub queries.
#980 Ability to override Azure storage endpoint for other regions (China)
#970 Deprecated partial_response_disabled proto field. Added partial_response_strategy instead. Both in gRPC and Query API. No PartialResponseStrategy field for RuleGroups by default means abort strategy (old PartialResponse disabled) as this is recommended option for Rules and alerts.
Metrics:
Added thanos_rule_evaluation_with_warnings_total to Ruler.
DNS thanos_ruler_query_apis* are now thanos_ruler_query_apis_* for consistency.
DNS thanos_querier_store_apis* are now thanos_querier_store_apis__* for consistency.
Query Gate thanos_bucket_store_series* are now thanos_bucket_store_series_* for consistency.
Most of thanos ruler metris related to rule manager has strategy label.
Ruler tracing spans:
/rule_instant_query HTTP[client] is now /rule_instant_query_part_resp_abort HTTP[client]" if request is for abort strategy.
#1009: Upgraded Prometheus (~v2.7.0-rc.0 to v2.8.1) and TSDB (v0.4.0 to v0.6.1) deps.
Changes that affects Thanos:
query:
[ENHANCEMENT] In histogram_quantile merge buckets with equivalent le values. #5158.
[ENHANCEMENT] Show list of offending labels in the error message in many-to-many scenarios. #5189
[BUGFIX] Fix panic when aggregator param is not a literal. #5290
ruler:
[ENHANCEMENT] Reduce time that Alertmanagers are in flux when reloaded. #5126
[BUGFIX] prometheus_rule_group_last_evaluation_timestamp_seconds is now a unix timestamp. #5186
[BUGFIX] prometheus_rule_group_last_duration_seconds now reports seconds instead of nanoseconds. Fixes our issue #1027
[BUGFIX] Fix sorting of rule groups. #5260
store: [ENHANCEMENT] Fast path for EmptyPostings cases in Merge, Intersect and Without.
tooling: [FEATURE] New dump command to tsdb tool to dump all samples.
compactor:
[ENHANCEMENT] When closing the db any running compaction will be cancelled so it doesn’t block.
[CHANGE] breaking Renamed flag --sync-delay to --consistency-delay#1053
Note that this was added on TSDB and Prometheus: [FEATURE] Time-ovelapping blocks are now allowed. #370 Whoever due to nature of Thanos compaction (distributed systems), for safety reason this is disabled for Thanos compactor for now.
#1055 Gossip flags are now disabled by default and deprecated.
#964 repair: Repair process now sorts the series and labels within block.
#1073 Store: index cache for requests. It now calculates the size properly (includes slice header), has anti-deadlock safeguard and reports more metrics.
⚠️ WARNING ⚠️ #873 fix fixes actual handling of index-cache-size. Handling of limit for this cache was broken so it was unbounded all the time. From this release actual value matters and is extremely low by default. To “revert” the old behaviour (no boundary), use a large enough value.
#529 Massive improvement for compactor. Downsampling memory consumption was reduce to only store labels and single chunks per each series.
Qurerier UI: Store page now shows the store APIs per component type.
Prometheus and TSDB deps are now up to date with ~2.7.0 Prometheus version. Lot’s of things has changed. See details here #704 Known changes that affects us:
prometheus/prometheus/discovery/file
[ENHANCEMENT] Discovery: Improve performance of previously slow updates of changes of targets. #4526
[BUGFIX] Wait for service discovery to stop before exiting #4508 ??
prometheus/prometheus/promql:
[ENHANCEMENT] Subqueries support. #4831
[BUGFIX] PromQL: Fix a goroutine leak in the lexer/parser. #4858
[BUGFIX] Change max/min over_time to handle NaNs properly. #438
[BUGFIX] Check label name for count_values PromQL function. #4585
[BUGFIX] Ensure that vectors and matrices do not contain identical label-sets. #4589
[ENHANCEMENT] Optimize PromQL aggregations #4248
[BUGFIX] Only add LookbackDelta to vector selectors #4399
[BUGFIX] Reduce floating point errors in stddev and related functions #4533
prometheus/prometheus/rules:
New metrics exposed! (prometheus evaluation!)
[ENHANCEMENT] Rules: Error out at load time for invalid templates, rather than at evaluation time. #4537
prometheus/tsdb/index: Index reader optimizations.
Thanos store gateway flag for sync concurrency (block-sync-concurrency with 20 default, so no change by default)
Tests against Prometheus below v2.2.1. This does not mean lack of support for those. Only that we don’t tests the compatibility anymore. See #758 for details.