This proposal document describes how currently the healthiness of store nodes is handled and why it should be changed (e.g. to improve response caching and avoid surprises).
It explores a few options that are available - weighs their pros and cons, and finally proposes one final variant which we believe is the best out of the possible ones.
Currently Thanos Query updates the list of healthy store nodes every 5 seconds. It does that by sending the Info()
call via gRPC. The last successful check is noted in the LastCheck
field. At this point if it fails then we note the error and remove it from the set of active store nodes. If that succeeds then it becomes a part of the active store set.
After --store.unhealthy-timeout
passes since LastCheck
then it also gets removed from the UI. At this point we would forget about it completely.
Every time a query is executed, we consult the active store set and send the query to all of the nodes which match according to the external labels and the min/max times that they advertise. However, if a node goes down according to our previous definition then in 5 seconds we will not send anything to it anymore. This means that we won’t even be able to control this via the partial response options since no query is sent to those nodes in the first place.
This is problematic in the cache of end-response caching. If we know that certain StoreAPI nodes should always be up then we always want to have not just an error (if partial response is disabled) but also the appropriate Cache-Control
header in the response. But right now we would only get it for maximum 5 seconds after a store node would go down.
Thus, this logic needs to be changed somehow. There are a few possible options:
--store.unhealthy-timeout
could be made to apply to this case as well - we could still consider it a part of the active store set while it is still visible in the UI.--store.hold-timeout
which would be --store.unhealthy-timeout
’s brother and we would hold the StoreAPI nodes for max(hold_timeout, unhealthy_timeout)
.--store.strict-mode
could be introduced which means that we would always retain the last information of the StoreAPI nodes of the last successful check.--store
could be extended to include another flag which would let specify the previous option per-specific node.Lets look through their pros and cons:
--store.unhealthy-timeout
as well while setting it.If we were to graph these choices in terms of their incisiveness and complexity it would look something like this:
Most incisive / Least Complex ------------ Least incisive / Most Complex
#1 #2 #4
#3
After careful consideration and with the rationale in this proposal, we have decided to go with the third option. It should provide a sweet spot between being too invasive and providing our users the ability to fall-back to the old behavior.
This deserves a separate discussion and/or proposal. The issue when adding a completely new store node via service discovery is that a new node may suddenly provide new information in the past. In this paragraph when we are saying “new” it means new in terms of the data that it provides. Generally over time only a limited number of Prometheus instances will be providing data that can only change in the future (relatively to the current time).
When this happens, we will need to most likely somehow signal the caching layer that it needs to drop (some of the) results cache that it has depending on:
The way this will need to be done should be as generic as possible so the design and solution of this is still an open question that this proposal does not solve.
--store.strict-mode
which will make it always retain the last successfully retrieved information via the Info()
gRPC method of statically defined nodes and thus always consider them part of the active store set.--store.strict-mode
in Thanos Query which will make it keep around statically defined nodes. It will be disabled by default to reduce surprises when upgrading.