How we helped Retell AI Elevate Generative AI Workloads

Here's how we worked with Retell AI to be a generative AI leader for AI voice.

The Problem

Retell AI was scaling its next-generation conversational AI stack on AWS using a heterogeneous fleet of GPU-accelerated EC2 instances, primarily G5/G6 (A10G, L4) and H100, to support continuous ASR/LLM fine-tuning and ultra-low-latency real-time inference. As customer adoption grew, Retell’s workloads became increasingly unpredictable: training cycles varied dramatically in duration and GPU intensity, while inference traffic spiked with customer call patterns. This volatility made it nearly impossible for Retell to forecast a stable GPU baseline, resulting in both idle GPU periods and sudden capacity shortages that jeopardized model iteration speed and SLA reliability.

At the same time, Retell’s rapidly evolving Gen-AI architecture created substantial financial exposure. Choosing the right compute footprint across multiple GPU families was technically complex, and AWS-recommended Savings Plans carried real risk: a shift in model architectures, memory requirements, or instance families could strand commitments and inflate cost of goods sold. Without a way to de-risk long-term EC2 commitments, Retell faced rising month-over-month spend, reduced experimentation velocity, and infrastructure costs scaling faster than product revenue.

The Solution

Architecturally, the solution integrated cleanly into Retell’s existing AWS environment. Pump guided Retell in aligning their GPU workloads to EC2 G6e, G6, and G5 Savings Plan eligible usage, while allowing high-variance training workloads to remain On-Demand or elastically scaled using native AWS services such as Amazon EC2 Auto Scaling and Amazon EKS for containerized GPU jobs. No changes were required to Retell’s inference stack, ASR/LLM pipelines, or model deployment flow; instead, Pump focused on optimizing cost attribution, commitment structure, and baseline right-sizing. As a result, Retell immediately closed large coverage gaps, especially on G6e, where coverage was previously only 28%, reducing effective GPU cost, stabilizing monthly spend, and unlocking predictable economics for future ASR/LLM scaling. Pump conducted a full GPU consumption and Savings Plan attribution analysis across Retell’s heterogeneous fleet focusing on the g5, g6, and g6e instance families, where coverage gaps were creating the highest effective cost. Using AWS Cost Explorer, Compute Savings Plans, and AWS Cost & Usage Report (CUR) data, Pump built a GPU-specific baseline model that distinguished steady-state inference workloads from volatile fine-tuning cycles. This allowed Pump to identify a reliable, commit-worthy core footprint on G6e, Retell’s largest and fastest-growing cost center despite high surface-level volatility. Pump then deployed its automated underwriting platform to insure the Savings Plan commitments, enabling Retell to confidently make multi-year commitments on the GPU instance families they truly depended on without assuming the financial risk of over-commitment.

Success

We worked with them to successfully save hundreds of thousands of yearly by executing a 3 year $27.82 per hour Savings Plan for the next three years. This, on top of having expert solution architects providing recommendations, guidance on how to build to best and most optimal setup, Retell was setup for success.

Analysis

Pump completed a detailed Total Cost of Ownership analysis focused specifically on Retell AI’s GPU driven Gen AI workloads. The analysis began by extracting instance level usage patterns from the AWS Cost and Usage Report and isolating steady state inference demand from variable training and fine tuning cycles across G5, G6, and G6e instance families. Pump calculated real GPU hour consumption, utilization efficiency, and workload elasticity to identify the minimum stable compute baseline that Retell could confidently run on AWS for the next one to three years. This required modeling GPU memory profiles, throughput characteristics, and concurrency patterns to determine which workloads were portable and which required fixed architecture on current generation accelerators.

Pump then generated forward looking TCO projections that compared multiple commitment strategies, including one year and three year Compute Savings Plans, as well as potential migration into higher performance instance families as Retell scaled. These models quantified the incremental AWS consumption that Retell would drive as they grew inference concurrency and onboarded new customers. Pump also evaluated how long term commitments would support AWS net new GPU growth by ensuring Retell could adopt more powerful instance families such as G6e at scale without financial risk. The final TCO output demonstrated that a multi year Savings Plan aligned with Retell’s growth trajectory, increased AWS consumption predictability, and provided a clear economic pathway for Retell to expand its Gen AI workloads on AWS with confidence.

PreviousAccount Reorganization NextHow we helped Arc Boat Optimize for to be in a Generative AI World

Last updated 5 days ago