Dwarkesh Podcast

How GPT-5, Claude, and Gemini are actually trained and served — Reiner Pope

2h 13m · Transcribed via assemblyai · Watch on YouTube

Blackboard lecture from Reiner Pope (CEO Maddox, ex-Google TPU architecture) — the technical *underpinning* for why everything in the 20VC + All-In episodes is true. Roofline analysis (memory bandwidth vs compute) explains: why optimal inference batch is ~300 × sparsity (~2-3k tokens), why coding-agent 'fast modes' charge 6x for 2.5x speed, why Gemini's pricing jumps 50% at 200k context, why output tokens cost 5x input tokens, why frontier models effectively run inside a single rack, why scale-up domain size (NVL72 → Rubin 500+) is the actual GPU-generation unlock not raw FLOPS, and why **memory bandwidth — not compute, not even just power — is the deepest bottleneck.** Per Dylan Patel cited mid-episode: ~50% of 2026 hyperscaler CapEx is going on memory. Models are ~100x overtrained vs Chinchilla because inference token volume across a model's 2-month life exceeds training tokens.

Key points

**The two-equation roofline:** inference time = max(compute_time, memory_time). Compute = batch × active_params / FLOPS. Memory = (total_params + batch × context × bytes_per_token) / mem_bandwidth. Below the crossover you're memory-bound, above you're compute-bound. This single framework drops out *every* observable AI-pricing fact.
**Optimal batch ≈ 300 × sparsity (~2-3k tokens).** For DeepSeek (32 of 256 experts) ~ 8 × 300 = 2,400. The 'train' analogy: batches depart every ~20ms (HBM drain time). Late passengers wait for the next train. Worst-case latency 2 × 20ms = 40ms. To be 'competitive at scale' you need ~128k tokens/sec served = 1/1000 of Gemini.
**Why coding-agent fast-mode is 6x price for 2.5x speed:** smaller batches = weight-fetch cost not amortised. The lower-bound on cost lives at the batch where compute-time = memory-time. You cannot meaningfully drop below that floor regardless of price. Conversely, no point offering 'slow mode' below the same floor — you'd save nothing.
**Mixture-of-experts forces single-rack residency.** All-to-all communication pattern; scale-out fabric is 8x slower than NVLink scale-up. Half the traffic crossing rack boundaries kills throughput. This is *the* reason NVL72 (and Rubin's 500+ scale-up) matters more than per-GPU FLOPS gains.
**Pipelining works for >1 rack but doesn't help KV cache.** Pipeline stages cancel the weight-storage benefit because more sequences must be in flight to keep racks busy. Translation: you can scale parameters across racks (eventually); you cannot scale long context the same way. Context length is HBM-bandwidth-bound and that's the deepest wall.
**Gemini 3.1's 50% price jump at 200k context** = the empirical inflection point where memory time crosses compute time. Reverse-engineered from API pricing, implies ~2KB KV cache per token — plausible with 8 KV-heads, d_head=128. **API pricing leaks architecture.**
**Output is 5x more expensive than input** because prefill is compute-bound (parallel, large effective batch) and decode is memory-bandwidth-bound (one token at a time, fetching the whole KV cache for each step). Direct read on which models are how memory-constrained.
**Cache-hit pricing reveals memory tier:** 5-minute window discount ≈ DDR drain time; 1-hour window discount ≈ flash or even spinning disk. Frontier providers literally caching contexts to spinning rust because the economics work at the right hold time.
**Models are ~100x overtrained vs Chinchilla.** Reiner equates cost(pretrain) ≈ cost(RL) ≈ cost(inference). Working backwards from ~50M tokens/sec lifetime over 2 months and ~100B active params: pretrain is ~150-200T tokens, but Chinchilla-optimal would be ~2T. **Inference >> training in lifetime cost** — the entire 'sum of human knowledge' in tokens gets re-emitted by every served model. This is the real reason for serving-cost obsession.
**Sparsity has sub-linear quality scaling.** 4x more experts buys ~4x effective parameters in quality terms (per the unified routed-LM laws). Profitable only if memory capacity exists to host the inactive experts. This is why DeepSeek-style fine-grained sparse MoE wins.
**Why context length stalled at ~200k for two years:** there is no solution to the HBM memory wall. Sparse attention's sqrt-scaling helps but eventually loses quality. The 'continual learning isn't needed if context is long enough' thesis (Dario) requires 100M-token contexts that current memory tech cannot support. **A real ceiling on the agent-as-employee narrative until HBM scales materially or attention becomes fundamentally sparser.**
**RevNets / Feistel-cipher trick:** import the invertibility construction from cryptography to skip storing activations during training — rematerialise them on the backward pass. Trades compute for memory. Tells you how desperate the field is for memory headroom.
**Scale-up domain size is the GPU-generation story.** Hopper 8 → Blackwell 72 → Rubin 500+. The headline FLOPS gains matter less than the available memory bandwidth in parallel for weight loads (which scales with scale-up domain). Explains why Gemini was ahead on long-context for a year (Google's TPU pods had bigger scale-up domains earlier).

Notable quotes

If you do not batch many users together, the cost can be a thousand times worse. Batch size is the single biggest lever in inference economics.

Reiner Pope · 4:00

There are no dark GPUs — but there is a memory wall. Hyperscalers are spending half their CapEx on memory this year.

Reiner Pope (channeling Dylan Patel) · 1:13:20

API pricing actually leaks information about the architecture. The 50% jump at 200k context tells you exactly where memory time crosses compute time.

Reiner Pope · 1:53:20

Each model should generate the sum of human knowledge on its output — because cost-equilibrium says inference tokens equal pretrain tokens.

Reiner Pope · 1:30:00

The reason scale-up size matters isn't memory capacity — it's bandwidth. The bandwidth lets you do longer context, which is what makes models agentic.

Reiner Pope · 1:21:40

I don't see a good path to solving the memory wall. The empirical result is context lengths haven't moved in two years.

Reiner Pope · 2:08:20

Themes

Memory bandwidth, not compute, is the deepest bottleneck
Roofline analysis explains every observable AI pricing fact
Scale-up domain size (NVL72 → Rubin) is the real GPU-generation unlock
Models are ~100x overtrained vs Chinchilla because inference dominates lifetime cost
200k context ceiling caps the long-horizon agent thesis until HBM scales

How GPT-5, Claude, and Gemini are actually trained and served — Reiner Pope

Key points

Notable quotes

Themes

Mentioned

People

Companies

Ideas