Written by Anugrah Vijay and Vic Thomas and published on Nov 09, 2022
Anugrah and Vic are Medallia employees. However, the opinions expressed here are their own, and not necessarily endorsed by Medallia or their co-workers. References to “we” in this article pertain either to Anugrah and Vic specifically or to generalized experiences of the observability engineering team at Medallia.
At Medallia, we successfully operate a hybrid Thanos architecture that allows our users to experience high-performance queries of software and infrastructure metrics generated across 10 large colocation data centers and 30+ small virtual data centers via a single pane of glass.
By “hybrid”, we mean that we have a single, global Thanos Query deployment. This global Thanos Query makes federated queries to the large colocation data centers via chained Thanos Query components. Additionally, the global Thanos Query also talks to a Thanos Receive ring that receives samples via remote write from the smaller virtual data centers, which are public cloud (AWS, OCI, etc.) Kubernetes clusters.
The number of active series that our metrics platforms ingests at any given time varies between 900 Million - 1.1 Billion, depending on the workloads that are running across our global compute infrastructure. We define active series as the sum of the series in the prometheus head blocks.
With a scrape frequency of once per minute, that puts or ingestion throughput between 13-17 Million samples per seconds across all our ingest components (Prometheus/Thanos Receive).
Founded in 2001, Medallia is a pioneer and leader in the field of experience management as a SaaS solution. The initial focus for the company was on collecting and analyzing solicited feedback (i.e., web-based surveys). But, the focus has evolved over the years to include a variety of interactions, including voice, video, digital, IoT, social media, and corporate-messaging tools.
Observability is largely a practice of capturing a variety of health and performance signals from software and hardware, ingesting those signals into appropriate data storage systems, and providing tools for analysis and alerting on that data. Analogously, Medallia captures feedback signals from its customers’ customers and employees, ingests those signals into its data storage systems, and provides analysis and alerting tools for use by its customers.
Just as a goal in observability is to catch problems early and to facilitate fast mitigation, Medallia helps its customers to detect customer and employee dissatisfaction and to take proactive action while the relationship can still be salvaged .
With such a strong analogy from our own business to lean on, the value of investing in observability has not been a difficult concept to justify. However, until the middle of the previous decade, observability within Medallia was limited to reachability monitoring and log aggregation. At that point, however, the company’s broad-focused SRE organization brought forth the use of metrics, which was both a revelation to and a revolution within the engineering organization with respect to awareness of the quality and performance of our software.
However, there have been challenges along the way. Our metrics journey is a story with many chapters.
Circa 2015, the SRE organization initiated a contract with a vendor for metrics dashboards and alerts. For the first time, in real-time, engineers could observe behavioral characteristics of applications and infrastructure. Engineers rejoiced!
However, the initiative’s wild success soon revealed that costs would become prohibitive. For a private company looking to go public, there can be pressures to scale costs logarithmically with respect to revenues. So, another solution was needed.
By 2018, a small task force within the SRE organization had deployed a proof of concept for an open source, self-hosted solution for metrics storage and alerting with the TICK stack (Telegraf, InfluxDB, Chronograf, Kapacitor). Leveraging Grafana for visualizations, the solution was immediately appealing. The query language, influxQL, was SQL inspired, which opened the possibility of reducing the learning curve for engineers onboarding to the solution.
However, SQL-derived languages can be subpar for time series analysis, burdening the query writer with the task of aggregating both horizontally (i.e., temporally, into time buckets) and vertically (i.e., over various label dimensions). Further, the query language for this solution did not allow “join” queries across multiple metrics. And the alerting language (TICK Scripts) was a very different beast altogether. The final nail in the coffin was that the open-source version of the product had limited ability to scale. This left us with two very unpalatable choices – either manage a large number of deployments of mutually exclusive and collectively exhaustive metrics data sets or pay big money for the enterprise product.
By late 2019 it was clear to the SRE organization that another solution would be necessary very soon. Again, a small task force deployed a proof of concept of Prometheus. Immediately it was obvious that the query language, PromQL, was sufficiently more powerful and elegant. Moreover we could use PromQL both for dashboards and for alert definitions. Additionally, the pull-based ingest pipeline provided benefits that could be the subject of an article of its own.
The only problem was that, like the incumbent solution, it was clear that routing all samples from all data centers into a single, massively vertically-scaled Prometheus was not going to be a sustainable solution. A failure in such a configuration would take down the entire metrics system. WAL file replay would take forever. How would we route metrics from all data centers into that Prometheus?
At this point, the task force discovered the Thanos project. Thanos components would solve several major problems:
In early 2020, the SRE task force handed off their proof of concept to the newly-formed Performance and Observability (POE) team for hardening, widespread launch, and ongoing support. Within a few months, the POE team made a few tweaks and tunings, deployed the solution to all of the engineering, and deprecated the previous solution.
One of the key principles of the architectural design was to construct the Thanos and Prometheus stack in an identical manner in each colocation data center. This would allow us to think of each data center in a plug-and-play kind of manner. Each data center would have a data center-scoped Thanos Query. In a primary data center, we would operate Grafana and a globally-scoped Thanos Query that in turn talks to each of the data center-scoped Thanos Query components.
This solution was elegant and relatively easy to support. The colocation data centers are sufficiently large to easily accommodate all the Thanos and Prometheus components.
However, by late 2020 Medallia had expanded its strategy to make more use of public cloud providers such as Amazon Web Services (AWS) and Oracle Cloud Infrastructure (OCI). So, the POE team faced a new mandate – incorporate metrics generated within these virtual data centers into the global view.
In general, these virtual data centers are small Kubernetes clusters, with compute nodes that are much smaller in terms of CPU and memory capacity than what we enjoy in the colocation data centers. Deploying the full complement of Thanos and Prometheus components in such an environment would require an unacceptably high percentage of capacity within those environments.
Further, we could anticipate that the number of virtual data centers would grow very fast. Fanning out distributed queries to 8 or 10 colocation data centers is one thing. But, fanning out distributed queries to an additional tens or even hundreds of virtual data centers gave us worries – not that Thanos Query couldn’t handle the job, but that the probability of one or more network glitches across so many remote destinations at any given point in time would not be acceptably small.
To solve these two issues, we decided to experiment with deploying Prometheus in the virtual data centers as essentially a collection agent. These Prometheus would remote write their samples to a new Thanos Receive ring within our primary colocation data center. The global Thanos Query could then talk to Thanos Receive in addition to talking to the data center-scoped Thanos Query components, thereby giving us the global view of metrics that we require.
For the past two years, this hybrid solution – both federated queries and remote-write – has met our needs very well. We are fully committed to Prometheus-based dashboards and alerting. Strategically, at this point, our ongoing focus will be to minimize total cost of ownership both in terms of labor and compute resources.
Note that the architecture diagram is representative, intended so show how it can, by design, easily be extended to all the major public clouds.
As mentioned earlier, at any given moment in time, the current number of active series is in the ballpark of 1 Billion, with our ingestion throughput at approximately 15 million samples per second due to our once-per-minute scrape frequency. We retain samples for 366 days – long enough to accommodate a full leap year.
Our current inventory, across all data centers, includes:
Over the course of a week, total resource utilization for all of our Thanos and Prometheus components across all data centers is within these ranges:
The following screenshots from our dashboards about the query pipeline show how query performance varies in a typical week.
The hybrid architecture has given us the flexibility that we needed to scale our solution organically and to meet our needs as we encountered them. But, the solution requires understanding, maintaining, and integrating two different architectures, with all their nuances. From both a technical and a human perspective, we recognize that simplicity is a key to our ability to scale our observability solutions cost-effectively as the business continues to grow rapidly.
The fact that we must support small, virtual data centers means that the remote write architecture will not be going away. So, to reduce the overall complexity of our operations and to help new team members to be onboarded quickly, we are strongly considering a move toward a 100% remote write architecture with centralized storage.
Additional reasons that such an architectural change is appealing include:
Of course, with any solution there are tradeoffs. If we move toward centralized storage it also includes some disadvantages:
If we move in the centralized storage direction, Thanos Receive will certainly get consideration. But, for the sake of due diligence, we will need to explore other solutions too, such as Cortex, Mimir, and others. However, whichever solution turns out to be the best fit for us, we do expect Thanos components to remain a major piece of our solution. Here is why:
The ability of Thanos components to play various roles within our architecture to support our ever-increasing scale of Prometheus-based metrics has been extremely valuable. It dramatically shortened the period of time from proof of concept to general availability. And, what has been amazing is that as our needs have shifted, Thanos has been able to facilitate those shifts without a radical reimplementation.
We certainly recommend Thanos to anyone facing a need to operate Prometheus metrics at any scale who need to maintain flexibility in their architecture to accommodate changes – both anticipated and unanticipated. Thanks for reading!