Baseten CEO Tuhin Srivastava on Custom Models, and Building the Inference Cloud
Tuhin Srivastava (Baseten CEO) — talking from inside the inference cloud at the moment everyone else's narrative starts. **30x revenue growth in 12 months, on track for >$1B in 2026.** 95% of tokens served are *custom* models (post-trained variants of open-source). Operates at mid-90s utilisation across **90 clusters in 18 clouds**, runs a daily 4pm capacity-allocation meeting. **GB200 access now requires 3-5 year contracts with 20-30% TCV prepay**, materially changing the IPO/financing calculus for inference companies. H100 still in demand 4.5 years post-launch — price still going up. Frontier open-source is now overwhelmingly Chinese (DeepSeek, Moonshot/Kimi, Canopy, Orpheus); 'effectively the Chinese government is subsidising US enterprise.' 400% NDR, top-30 customers never churned. Confirms Reiner Pope's framing — disentangling pre-fill and decode is 'the next set of primitives.' On Jevons: 'inference is the last market — even if there's AGI, all that's left is inference.'
Key points
- **30x revenue growth in 12 months, >$1B run-rate trajectory in 2026, 400% NDR, top-30 customer churn = 0.** The inference cloud category is real and Baseten is a clean read on its growth rate. AI 'long tail' = customers in-housing intelligence + post-training is now mainstream enough to be the default pattern.
- **95% of tokens served are custom (post-trained) models.** Almost no one is running vanilla open-source weights at scale. Baseten acquired the Parsed research team to support post-training because 'inference and post-training are two sides of the same problem' — inference begets evals, evals beget reward signals, reward signals beget more post-training, more post-training begets more inference.
- **Open-source frontier is now Chinese.** Customer-favoured open models cited by name: DeepSeek, Moonshot/Kimi, Canopy, Orpheus (text-to-speech). 'It would be a fundamental problem if America never came up with good open-source models.' Sarah Wang's framing accepted: 'effectively the Chinese government is subsidising US enterprise' via these freely-available models. **Direct point of friction with Sacks's 'sovereign-AI' framing on All-In.**
- **Capacity is structurally constrained — and the way it's constrained has changed.** Baseten operates at mid-90s utilisation across **90 clusters in 18 clouds**. Daily 4pm standing capacity-allocation meeting. New GB200 capacity now requires **3-5 year contracts with 20-30% TCV prepay**. 'What becomes important when acquiring capacity is having low cost of capital' — direct push toward earlier IPO for inference plays.
- **H100 is still appreciating in the secondary market 4.5 years post-launch** despite Blackwell + Rubin coming. Useful life now estimated at 9 years. Direct cross-reference to PTJ's leverage / capacity-allocation thesis — these are very long-duration capital commitments.
- **'Probably 12 good clouds, 3-4 in the gold tier.'** A lot of new GPU suppliers are 'grifty' — haven't run data centres before, don't understand SLAs especially for inference. Even when capacity is nominally available, operational diligence kills it. Multi-cloud inference fabric (Baseten's tech) becomes the only way to avoid being held hostage by individual provider failures.
- **Disentangling pre-fill and decode.** Direct echo of Reiner Pope's Dwarkesh episode: pre-fill is compute-bound, decode is memory-bandwidth-bound, and the next set of inference primitives treats them as separate problems. KV-cache-aware routing, speculation techniques, dedicated decode chips — all on Baseten's roadmap.
- **'GPUs as a service is not sticky. Inference + software layer is incredibly sticky.'** None of top-30 customers have ever churned. The strategic lesson for the labs: in a compute-constrained world the labs are vertically integrating (own the inference cloud). 'In a world of constrained compute, the number one thing to own is compute.'
- **Customer pattern: capability first, cost second.** Customers come in for the highest-quality model and then optimise. 'No GPUs pre product-market-fit; no post-training pre product-market-fit.' Once an application has shown user-signal value, post-train a specialised model that's better-faster-cheaper for that specific job (e.g. customer support model that doesn't need to be good at coding).
- **Lean-org was the company until 12-18 months ago** — Sarah Wang told Tuhin he 'just needed leaders.' Hero culture explicitly banned. First-principles + kind + low-ego + can-handle-no-manager = the explicit hiring rubric. **Fourth lean-ops case in this issue (with AppLovin, 20VC framing, Kalshi).** But unlike those, Baseten has accepted that infrastructure scale eventually requires a leadership layer.
- **Pager culture is the operations DNA.** Co-founder Amir's 7-year-old asks 'is that a P0?' when his pager goes off. Senior AWS execs' pagers all went off during a 45-min meeting — 'it's a cultural thing, you just have to get used to it.' Self-selecting filter for who can build infrastructure companies.
- **Jevons Paradox confirmed in customer behaviour.** 'When inference cost drops, agents just run longer or do more work to get to a larger end.' Compute scales from an inference perspective too. Tuhin: **'inference is the last market — even if there's AGI, all that's left is inference.'** This is the operator-side mirror of Reiner Pope's overtraining-vs-Chinchilla math from the Dwarkesh episode.
Notable quotes
30x growth in the last 12 months. None of our top 30 customers have ever churned. We're talking 400% net dollar retention.
GPUs as a service is not sticky. Inference with the software layer included is incredibly sticky. In a world of constrained compute, the number one thing to own is compute.
Effectively the Chinese government is subsidising US enterprise via these open-source models. If we don't have access to that intelligence, we won't be able to innovate as fast.
If you want a B200 right now from a good cloud, you're not getting that less than a three-to-five-year contract with a 20-30% TCV prepay. Cost of capital is everything.
Inference is the last market. Even if there's AGI, all that's left is inference.
Capacity. That's what keeps me up at night. There's no world in which there's enough compute to get the value we want out of LLMs in the next five to ten years.
Themes
- Inference cloud category growth at 30x with structural compute constraint
- 95% of served tokens are custom post-trained models
- Capital-cost is now the binding constraint on capacity acquisition
- Frontier open-source has shifted to Chinese labs
- Inference + software stickiness vs commoditised GPU-as-a-service
Mentioned
Companies
Ideas
- Baseten 30x growth in 12 months
- >$1B 2026 revenue trajectory
- 95% custom-model token share
- Inference + post-training are two sides of same problem
- Frontier open-source is Chinese-led
- Chinese subsidy of US enterprise via open-source models
- GB200 3-5 year contract + 20-30% TCV prepay
- H100 price still rising 4.5 years post-launch
- 9-year GPU useful life
- 12 good clouds, 3-4 in gold tier
- Multi-cloud inference fabric (90 clusters / 18 clouds)
- Pre-fill / decode disentanglement as next primitive
- Inference + software = stickiness; GPU-only = commodity
- Capability-first then cost optimisation customer pattern
- Hero-culture-banned lean-ops hiring rubric
- Pager-culture as infrastructure DNA
- Jevons paradox in inference (longer agent runs)
- 'Inference is the last market — even if AGI'