How Okta Scaled From 12 To 1,000 Kubernetes Clusters With Argo Cd

Sedang Trending 1 bulan yang lalu

ATLANTA — Let’s conscionable opportunity that Okta’s Auth0 level customers for backstage unreality weren’t getting what they apt wanted. Okta’s support was problematic astatine best, particularly for those customers astatine scale.

This led to a determination to return a immense stake connected unfastened root for GitOps, specifically nan CNCF-graduated task Argo CD (previously called Argo Workflows). It wasn’t conscionable a elemental assistance and shift; instituting it crossed specified a wide standard of operations required complete 5 years. During KubeCon + CloudNativeCon here, Okta engineers Jérémy Albuixech and Kahou Lei elaborate their tests and tribulations during their talk, “One Dozen To One Thousand Clusters: How Argo Kept up arsenic We Scaled.”

The extremity result: “We tin safely opportunity that nan results now show they tin standard complete a 1000 clusters, up from conscionable a twelve aliases truthful a fewer years ago,” Albuixech said during nan talk.

.@okta‘s immense stake connected unfastened source: During @KubeCon +CloudNativeCon NA successful Atlanta, Okta’s Jérémy Albuixech & Kahou Lei elaborate nan ups and downs during their talk “One Dozen To One Thousand Clusters: How Argo Kept Up arsenic We Scaled.” @thenewstack @linuxfoundation pic.twitter.com/i1EphdFOnZ

— BC Gain (@bcamerongain), Nov. 14, 2025

Credit Due

Before diving into Argo CD and really this came about, it’s worthy saying what Argo CD is and what GitOps entails. Argo CD is not conscionable a package instrumentality aliases level for scaling Kubernetes clusters, but it is simply a well-received task pinch increasing organization support. It would besides beryllium amiss not to mention nan parallel CNCF-graduated task Flux.

GitOps operators for illustration Argo CD and Flux show git arsenic nan immutable root of truth for nan desired authorities and use that desired authorities to nan existent state. The immutable building of Git besides automates changes to applications and codification successful clusters erstwhile vulnerabilities are discovered — arsenic they invariably will beryllium — during runtime. Likewise, if personification were to modify runtimes straight (such arsenic would hap during a information breach), GitOps operators will automatically observe these changes and overwrite them pinch nan desired authorities successful Git.

Flux and Argo CD continuously show exertion definitions and configurations defined successful a git repository and comparison nan specified authorities of these configurations pinch their unrecorded authorities connected nan cluster. Argo CD reports immoderate configurations that deviate from their specified state. These reports let administrators to automatically aliases manually resync configurations to nan defined state. Again, git ever serves arsenic nan azygous root of truth.

Open Source Truth

Flash backmost to complete 5 years ago. In its first iteration, Albuixech and Lei described really Okta’s Auth0 level was chiefly a measurement to big services for customers who wanted their infrastructure and configuration stored successful a backstage unreality account. It was targeted for a very mini subset of customers and, arsenic a result, it was not built pinch precocious standard and automation arsenic priorities. It had Snowflake configurations and infrastructure. Updates were done manually by an operator, entree was not arsenic unafraid arsenic it could beryllium and it relied connected early days unreality infrastructure — fundamentally codification moving successful virtual machines (VMs).

“As request increased, we needed a caller level design,” Albuixech said. “After investigation and proofs of concepts, we ended up pinch a unreality autochthonal architecture utilizing nan Argo task heavily.” Argo CD handles work provisioning, Argo Workflows handles deployments, Terraform (with a civilization provider) handles Infrastructure arsenic Code (IaC), and each of this is orchestrated by power level services truthful we tin negociate each customer environments, Albuixech said.

Hard Work

One of nan beautiful things astir unfastened root projects is really nan organization users are perpetually proposing changes arsenic news and nan task itself grows. That said, unfastened root Argo CD exhibits respective “significant” challenges, arsenic Albuixech and Lei elaborate during their talk.

These see really nan auto-sync characteristic cannot beryllium utilized successful their deployment pipeline because it cannot grip Terraform limitations aliases respect customer-specific deployment windows, requiring a civilization “auto-sync” utilizing Argo Workflows and nan power plane.

As Lei described, Argo CD’s auto-sync ensures nan authorities successful nan Kubernetes cluster ever matches Git. “However, we cannot usage auto-sync successful our deployment pipeline because of our merchandise model. Each merchandise campaigner is simply a bundle containing work image versions, Terraform code, Kubernetes manifests, plugins and civilization logic,” Lei said.

“The homegrown application-X plugin initially caused refresh operations to return minutes because customize spawns a subprocess for each plugin, necessitating a forked binary. Running different plugin versions per customer required a Docker-in-Docker approach, adding further operational complexity,” Lei said.

One merchandise manifest corresponds to 1 Argo application, and 1 Argo exertion represents an full customer cluster, Lei said. Because infrastructure generated by Terraform affects nan configuration and secrets needed by services, Terraform must tally earlier nan Argo CD sync. Auto-sync cannot accommodate this dependency, nor tin it respect customer-specific deployment windows. The workaround: “So we instrumentality our ain ‘auto-sync’ utilizing Argo Workflows positive nan power plane,” Lei said.

Other challenges included really transient deployment failures — including clang loops, sync conflicts, plugin failures and stuck syncs — were common. To negociate these, nan squad built a bid statement interface (CLI) wrapper that classifies failures, enforces timeouts and controls retries, Albuixech and Lei described. “At scale, nan exertion controller often crashes nether load, nan UI becomes very slow and exertion statuses tin beryllium misleading, requiring controller scaling, infrastructure improvements, CLI tooling and ignore-resource settings,” Lei said. Upstream bugs, specified arsenic a title information successful nan exertion controller and capacity issues from untracked resources, had to beryllium addressed internally.

Workflows pinch 50 aliases much steps, maintained by aggregate teams, consequence conflicts and require move sub-workflow management. Large workflows for Kubernetes upgrades, Postgres blue-green updates, load tests, chaos tests and CI validation tin overwhelm nan Argo UI, prompting workarounds for illustration launching branded kid workflows. Some UI limitations forced different solutions, specified arsenic a Chrome plugin to region nan “Terminate” button, which bypassed exit hooks and collapsed automation. Despite these challenges, nan level continues to standard done extended civilization tooling, workflow orchestration and observant operational management, Lei said.

Amazing, Really

Given a scan of nan propulsion requests issued and feedback connected GitHub for this wildly successful unfastened root project, these are very communal problems they faced, and nan collaborative fixes are connected offer. It’s besides bully to support successful mind that this is simply a immense occurrence for a awesome unfastened root task successful position of its implementation. This is not conscionable a very cool GitOps level technology, but arsenic it continues, Argo CD continues to show its merit for Kubernetes, particularly successful GitOps astatine scale. And not to hide Flux, arsenic it has very proven take arsenic well.

Albuixech and Lei are evidently not marketers, but present is really Lei described success: “Despite these challenges, nan level continues to standard done extended civilization tooling, workflow orchestration and observant operational management.”

YOUTUBE.COM/THENEWSTACK

Tech moves fast, don't miss an episode. Subscribe to our YouTube channel to watercourse each our podcasts, interviews, demos, and more.

Group Created pinch Sketch.

Selengkapnya