Uber, 1 of nan original ride-hailing services, developed a distributed infrastructure earlier astir anyone moreover considered it for their enterprise. It ran Mesos earlier moving to Kubernetes 3 years ago.
The institution is now moving from connected premises to multicloud, which has its pros and cons arsenic Uber wrestles pinch really to optimize GPU usage crossed aggregate unreality providers, juggle workloads and create a cohesive converged infrastructure.
In a position astatine nan co-located AI Day astatine KubeCon + CloudNativeCon North America successful Atlanta this month, Andrew Leung, a elder unit technologist astatine Uber, offered a glimpse into what it’s for illustration for a institution that runs AI workloads to move from connected premises to a multicloud approach.
The Uber communicative shows nan disagreement betwixt information and compute successful endeavor architectures, pinch nan fungibility of GPUs arsenic a superior challenge. How companies accommodate to utilizing AI models will dangle connected their usage cases and which unreality providers they use. For Uber, it now intends managing aggregate unreality providers to optimize workloads, which creates its ain trade-offs.
The Challenges of Separating Data and Compute
Uber has utilized predictive models since 2018, Leung said. It now uses AI models crossed its endeavor for a car’s estimated clip of arrival, pricing, fraud discovery and nan Uber Eats ranking feed. The institution now uses ample connection models (LLMs) for custom-facing and soul tools. It has started to usage AI applications for merchant storefronts and agentic systems for soul workflows.
Uber uses unreality work providers for different usage cases, but separating information and compute has affected its soul infrastructure. How its teams person assembled their Kubernetes stacks reflects really their information and compute are separated, which allows them to optimize for each unreality but makes it challenging to build retired converged infrastructure.
The engineering teams, Leung said, support a information reservoir moving connected a azygous unreality provider. They usage a abstracted unreality supplier for conclusion and different microservices. The adjacent step: Bridge nan disagreement betwixt nan clouds they use.
That successful itself poses a situation erstwhile it makes business sense, peculiarly for its GPUs and GPU capacity. Even for Uber, GPUs are scarce and expensive. When dispersed crossed aggregate unreality services, it becomes challenging to leverage GPUs to their afloat potential.
Working pinch GPUs is not precisely a seamless unreality autochthonal acquisition compared to managing CPUs, Leung added, wherever portability is much straightforward.
“And truthful we extremity up having to deliberation astir usage cases that tin either reside wholly wrong 1 unreality supplier truthful that I tin put training and serving together, aliases I request to deliberation astir nan usage cases wherever it makes consciousness to really propulsion nan information from 1 supplier to another, successful bid to facilitate being capable to leverage that compute,” Leung said.
”It doesn’t make it rather arsenic seamless arsenic it could be, and you person to beryllium purposeful successful really you deliberation astir what workloads you’re going to beryllium converging together.”
And arsenic for capacity? Silos and over-indexing coming their ain group of issues.
“We ended up pinch a Kubernetes infrastructure focused connected batch and a Kubernetes infrastructure focused connected microservices,” Leung said. “The favoritism has been that nan hardware was segregated astatine nan cluster level, truthful we would person dedicated GPU clusters that were conscionable serving GPU workloads, and a number of CPU-based clusters serving CPU workloads.
“But that’s led to siloing of nan existent capacity and ends up benignant of over-indexing connected a Kubernetes cluster arsenic an abstraction for hardware, alternatively than leveraging a batch of what we tin do internally from Kubernetes itself.”
Disaster Recovery Overhead
Disaster betterment requires much overhead erstwhile moving AI workloads, said Leung, successful consequence to Madhuri Yechuri, CEO of Elotl, who interviewed Leung astatine nan colocated AI Day session. Again, nan situation of utilizing GPUs comes into play. The patterns pinch CPUs don’t apply.
“That is progressively analyzable for GPU infrastructure, fixed nan costs and scarcity of it,” Leung said. “That’s overmuch harder to tummy erstwhile you spot really overmuch other overhead you request to transportation for GPU and also, fixed nan truth that GPU workloads aren’t rather arsenic fungible arsenic CPU workloads, wherever I can’t arsenic easy conscionable dynamically battalion 8 workloads onto 1 GPU now, wherever I could person conscionable squeezed things onto a azygous CPU.”
Failovers are besides problematic. You can’t conscionable move GPU workloads astir for illustration CPU workloads.
“This point is optimized for this peculiar hardware, pinch this peculiar configuration,” Leung said. “If I do a failover, I can’t needfully easy conscionable move it to a different hardware configuration connected nan fly.”
The Future of Agentic Workflows
Agentic workflows are a different matter for Uber. The institution is building and fine-tuning existing models done respective initiatives to create agentic systems for its soul devices and to supply LLM-based support crossed soul systems. But it still represents nan number of GPU usage.
“But arsenic we summation finance there, it has nan imaginable to go larger scale,” Leung said. “The predictive models that we’ve been training aren’t apt to double successful size, for example, because they mostly turn on pinch business growth.”
There whitethorn travel a time erstwhile Uber unlocks immoderate agentic workflow and puts it everywhere, he said, which would correspond a sizeable summation successful what they request to support pinch GPUs.
But what astir utilization? It’s a matter Uber is reviewing.
“A batch of our investments successful those kinds of agentic workflows are often experimental, but still require a nontrivial footprint,” Leung said. “And arsenic an experimental product, they’re not successful accumulation necessarily. So they’re not taking constant, always-on, precocious volumes of traffic, which makes it fishy arsenic to whether aliases not, successful nan end, GPUs being provisioned for this point is really nan champion usage of them.”
Uber uses Ray almost exclusively for training. Ray is simply a unified model for scaling AI and Python applications. The infrastructure runs crossed their regular batch infrastructure, which besides runs Spark and different soul batch processing workloads.
A emblematic AI workflow will dwell of respective steps for information preprocessing and ETL-like tasks. They could beryllium moving connected Spark if they’re ETL-focused, pinch astir of nan training workloads powered via Ray. Nvidia Triton powers predictive models arsenic nan take of vLLMs increases for LLM usage cases. Models get optimized done TensorRT for circumstantial hardware, past tally done nan ONNX runtime.
An Unsolved Problem: Making GPUs More Fungible
How Uber makes GPUs much fungible starts pinch addressing their clusters, Leung said. For Uber, making GPUs much fungible remains an unsolved problem.
“Different GPU types pinch varying representation configurations and architectures can’t seamlessly substitute for each other, ” Leung said. “Today, pinch really we person clusters group up, nan cluster is nan boundary. And truthful we’re moving distant from nan cluster arsenic benignant of its ain silo location to thief make that much fungible arsenic nan first step.”
The Uber teams ran authentication and different services connected underutilized GPUs. But by moving those services to CPU clusters, they improved capacity and freed GPU capacity.
The issues pinch GPUs make observability a challenge. Uber uses Nvidia but is besides considering AMD hardware. Each supplier has different metrics.
Adapting to New Metrics
Using AI early has accumulated method indebtedness for Uber. Its teams person revamped their stack, leveled up and modernized. They precocious migrated their aged GPU metrics based connected Cadvisor, which did not support newer models.
Leung said Uber’s engineers person had to accommodate to nan GPU and nan differences successful metrics.
“What had happened is we advertise these low-level metrics, which past different teams began building their dashboards and metrics and systems around,” Leung said. “And then, erstwhile we deliberation about, OK, we want to really migrate this to a different metric set. Well, nan full institution is pinned to this mini group of metrics.”
They’re exploring building their ain API.
“You’re going to extremity up pinch a operation of a assortment of different metrics and pinch nuances astir what each of them intends and really teams should understand it erstwhile they’re trying to deliberation astir high-level utilization aliases costs ratio aliases representation usage,” Leung said.
“And truthful what we’ve been doing location is trying to build metrics almost arsenic an API wherever we person circumstantial level metrics that we’re exposing, which we tin past perchance root from a assortment of vendor-specific metrics. It doesn’t require nan personification to beryllium arsenic profoundly entrenched successful immoderate 1 circumstantial vendor’s exemplary metrics.”
YOUTUBE.COM/THENEWSTACK
Tech moves fast, don't miss an episode. Subscribe to our YouTube channel to watercourse each our podcasts, interviews, demos, and more.
Group Created pinch Sketch.
English (US) ·
Indonesian (ID) ·