Organizations perfectly request to compression arsenic overmuch worth retired of their telemetry arsenic they perchance can, for a number of reasons. Gathering telemetric information aliases observability is besides a very tricky proposition, to opportunity nan least.
On 1 hand, turning connected nan spigot to propulsion each metrics that an situation generates quickly becomes — to put it mildly — an unwieldy and unmanageable situation, not to mention unaffordable for most, if not all, organizations.
Too small sampling of metrics information intends that nan information is apt missing cardinal elements for debugging, interpreting aliases monitoring for imaginable outages and different problems. Optimizing operations and improvement becomes askew and inaccurate, aliases unreliable. Additionally, utilizing nan incorrect sampling information for metrics is of small to nary help.
This problem is compounded, aliases this move aliases dilemma is compounded, for very ample enterprises specified as, successful this case, Capital One Bank.
During nan Observability Day arena up of KubeCon + CloudNativeCon North America, Capital One engineers Joseph Knight and Sateesh Mamidala showed really they relied connected OpenTelemetry to lick nan tracing sampling information and were capable to instrumentality that crossed Capital One’s full operations worldwide.
Their efforts paid off: They reported a 70% simplification successful tracing information volumes.
It wasn’t an easy task, but OpenTelemetry served arsenic nan backbone for their gargantuan project, which they elaborate successful their KubeCon presentation.
Capital One’s displacement to @OpenTelemetry: Joseph Knight & Sateesh Mamidala, discussed why it was basal during their Observability Day talk « From Data Overload To Optimized Insights: Implementing OTel Sampling for Smarter Observability » earlier #KubeCon NA. @linuxfoundation pic.twitter.com/qZMtmn4Jdx
— BC Gain (@bcamerongain), Nov. 11, 2025
As Knight said during their talk, Capital One’s metrics progressive dealing pinch “more than a petabyte per time without immoderate sampling.”
The solution required a deployment of dedicated infrastructure. Tail-based sampling requires turning it into a horizontally scaling problem, arsenic you must “bring each nan spans together for a trace earlier you tin make a sampling decision,” Knight said.
This, he added, resulted successful layering collectors pinch a load-balancing exporter, a collector layer, and past a sampling processor layer, each wholly dedicated to tracing.
Why Capital One Chose OpenTelemetry Over Vendor Tools
Before adopting OpenTelemetry, Capital One’s engineers relied connected vendor devices that implemented their own, often disparate, sampling strategies, typically providing only head-based sampling, successful which nan determination to support a trace aliases not is made astatine nan opening of a request.
OpenTelemetry “gave america nan caller position that head-based sampling is not very effective,” Knight said.
The existent attack pinch OTel offers 2 cardinal benefits, Knight said. The first is that nan centralized squad now has power complete nan costs of distributed tracing. This power ensures that wide take is imaginable pinch nan disposable resources.
Second, nan squad tin supply guarantees to exertion teams that “they will beryllium capable to spot definite behaviour successful their tool,” specified arsenic circumstantial errors, which builds “a batch much comfortableness successful really sampling affects nan traces coming from their application,” Knight said. This, he added, “can’t beryllium achieved pinch micro, probabilistic aliases deadly sampling.”
Best Practices for Making Sampled Tracing Data Useful
The cardinal to making sampled information useful is nan summation of tags. Capital One’s squad adds tags to sampled traces to bespeak really they were selected and astatine what probabilistic ratio they were sampled. This is useful successful 2 ways, Knight said.
- Estimation: Teams tin estimate nan original trace information generated by multiplying nan trace worth by nan probabilistic ratio, which gives an estimate for really galore traces aliases requests were generated anterior to sampling.
- Historical accuracy: By tagging nan information directly, if nan sampling ratios alteration complete time, nan original ratios are “baked successful pinch nan root data,” Knight said, allowing teams to look backward without seeing jumps complete time.
Furthermore, alternatively of relying connected each span for complaint information, teams should beryllium taught to usage metrics on pinch spans to get a much meticulous image of strategy behavior.
“We export successful nan semantic invention metrics, histograms for each azygous span that we generate, some from nan server connected your customer side,” Knight said.
Using these metrics for meticulous counts intends “you don’t request each span to understand nan complaint of your system,” he said. “Building rules and guides for translating tools, alerts and dashboards to usage metrics tin make this modulation easier.”
The Strategic Shift From Head- To Tail-Based Sampling
The displacement from head-based to tail-based sampling, successful which nan sampling occurs astatine nan extremity of nan trace, has been a success, Knight said. The teams are now “very happy that they are getting a overmuch much amended image now from nan races than before,” he said. This is because tail sampling allows nan determination to beryllium made aft receiving each nan spans and looking astatine nan full trace.
Despite nan challenges of uncovering nan correct equilibrium betwixt high-rate and low-rate applications, nan continued attraction connected dynamically adapting nan tail sampling processor is key. The Capital One squad intends to people this investigation arsenic an unfastened root contribution.
Ongoing Challenges and Future Goals successful Data Sampling
That 70% simplification successful trace measurement mightiness beryllium impressive, but nan squad is looking astatine nan remaining 30% and asking, “How tin we do better?” Knight said.
The cardinal situation is simply a “tug of war” betwixt high-frequency (high-rate) and low-frequency (low-rate) events successful nan probabilistic ratios, he said. High-rate applications tin grip a overmuch little probabilistic rate, whereas low-rate applications get starved astatine a little ratio. At scale, tailoring nan norm group to each circumstantial exertion is not feasible.
The existent attraction is connected building enhancements to nan tail-sampling processor that will springiness nan strategy nan expertise to, arsenic Knight said, “adapt to nan wave of events we spot dynamically, correct without config changes connected our side.”
YOUTUBE.COM/THENEWSTACK
Tech moves fast, don't miss an episode. Subscribe to our YouTube channel to watercourse each our podcasts, interviews, demos, and more.
Group Created pinch Sketch.
English (US) ·
Indonesian (ID) ·