How GPT-5, Claude, and Gemini are actually trained and served — Reiner Pope
Blackboard lecture from Reiner Pope (CEO Maddox, ex-Google TPU architecture) — the technical *underpinning* for why everything in the 20VC + All-In episodes is true. Roofline analysis (memory bandwidth vs compute) explains: why optimal inference batch is ~300 × sparsity (~2-3k tokens), why coding-agent 'fast modes' charge 6x for 2.5x speed, why Gemini's pricing jumps 50% at 200k context, why output tokens cost 5x input tokens, why frontier models effectively run inside a single rack, why scale-up domain size (NVL72 → Rubin 500+) is the actual GPU-generation unlock not raw FLOPS, and why **memory bandwidth — not compute, not even just power — is the deepest bottleneck.** Per Dylan Patel cited mid-episode: ~50% of 2026 hyperscaler CapEx is going on memory. Models are ~100x overtrained vs Chinchilla because inference token volume across a model's 2-month life exceeds training tokens.
Key points
- **The two-equation roofline:** inference time = max(compute_time, memory_time). Compute = batch × active_params / FLOPS. Memory = (total_params + batch × context × bytes_per_token) / mem_bandwidth. Below the crossover you're memory-bound, above you're compute-bound. This single framework drops out *every* observable AI-pricing fact.
- **Optimal batch ≈ 300 × sparsity (~2-3k tokens).** For DeepSeek (32 of 256 experts) ~ 8 × 300 = 2,400. The 'train' analogy: batches depart every ~20ms (HBM drain time). Late passengers wait for the next train. Worst-case latency 2 × 20ms = 40ms. To be 'competitive at scale' you need ~128k tokens/sec served = 1/1000 of Gemini.
- **Why coding-agent fast-mode is 6x price for 2.5x speed:** smaller batches = weight-fetch cost not amortised. The lower-bound on cost lives at the batch where compute-time = memory-time. You cannot meaningfully drop below that floor regardless of price. Conversely, no point offering 'slow mode' below the same floor — you'd save nothing.
- **Mixture-of-experts forces single-rack residency.** All-to-all communication pattern; scale-out fabric is 8x slower than NVLink scale-up. Half the traffic crossing rack boundaries kills throughput. This is *the* reason NVL72 (and Rubin's 500+ scale-up) matters more than per-GPU FLOPS gains.
- **Pipelining works for >1 rack but doesn't help KV cache.** Pipeline stages cancel the weight-storage benefit because more sequences must be in flight to keep racks busy. Translation: you can scale parameters across racks (eventually); you cannot scale long context the same way. Context length is HBM-bandwidth-bound and that's the deepest wall.
- **Gemini 3.1's 50% price jump at 200k context** = the empirical inflection point where memory time crosses compute time. Reverse-engineered from API pricing, implies ~2KB KV cache per token — plausible with 8 KV-heads, d_head=128. **API pricing leaks architecture.**
- **Output is 5x more expensive than input** because prefill is compute-bound (parallel, large effective batch) and decode is memory-bandwidth-bound (one token at a time, fetching the whole KV cache for each step). Direct read on which models are how memory-constrained.
- **Cache-hit pricing reveals memory tier:** 5-minute window discount ≈ DDR drain time; 1-hour window discount ≈ flash or even spinning disk. Frontier providers literally caching contexts to spinning rust because the economics work at the right hold time.
- **Models are ~100x overtrained vs Chinchilla.** Reiner equates cost(pretrain) ≈ cost(RL) ≈ cost(inference). Working backwards from ~50M tokens/sec lifetime over 2 months and ~100B active params: pretrain is ~150-200T tokens, but Chinchilla-optimal would be ~2T. **Inference >> training in lifetime cost** — the entire 'sum of human knowledge' in tokens gets re-emitted by every served model. This is the real reason for serving-cost obsession.
- **Sparsity has sub-linear quality scaling.** 4x more experts buys ~4x effective parameters in quality terms (per the unified routed-LM laws). Profitable only if memory capacity exists to host the inactive experts. This is why DeepSeek-style fine-grained sparse MoE wins.
- **Why context length stalled at ~200k for two years:** there is no solution to the HBM memory wall. Sparse attention's sqrt-scaling helps but eventually loses quality. The 'continual learning isn't needed if context is long enough' thesis (Dario) requires 100M-token contexts that current memory tech cannot support. **A real ceiling on the agent-as-employee narrative until HBM scales materially or attention becomes fundamentally sparser.**
- **RevNets / Feistel-cipher trick:** import the invertibility construction from cryptography to skip storing activations during training — rematerialise them on the backward pass. Trades compute for memory. Tells you how desperate the field is for memory headroom.
- **Scale-up domain size is the GPU-generation story.** Hopper 8 → Blackwell 72 → Rubin 500+. The headline FLOPS gains matter less than the available memory bandwidth in parallel for weight loads (which scales with scale-up domain). Explains why Gemini was ahead on long-context for a year (Google's TPU pods had bigger scale-up domains earlier).
Notable quotes
If you do not batch many users together, the cost can be a thousand times worse. Batch size is the single biggest lever in inference economics.
There are no dark GPUs — but there is a memory wall. Hyperscalers are spending half their CapEx on memory this year.
API pricing actually leaks information about the architecture. The 50% jump at 200k context tells you exactly where memory time crosses compute time.
Each model should generate the sum of human knowledge on its output — because cost-equilibrium says inference tokens equal pretrain tokens.
The reason scale-up size matters isn't memory capacity — it's bandwidth. The bandwidth lets you do longer context, which is what makes models agentic.
I don't see a good path to solving the memory wall. The empirical result is context lengths haven't moved in two years.
Themes
- Memory bandwidth, not compute, is the deepest bottleneck
- Roofline analysis explains every observable AI pricing fact
- Scale-up domain size (NVL72 → Rubin) is the real GPU-generation unlock
- Models are ~100x overtrained vs Chinchilla because inference dominates lifetime cost
- 200k context ceiling caps the long-horizon agent thesis until HBM scales
Mentioned
Ideas
- Roofline analysis (memory time vs compute time)
- Optimal batch ≈ 300 × sparsity
- 20ms HBM drain time = the train schedule
- Mixture-of-experts forces single-rack residency
- Scale-up vs scale-out 8x bandwidth gap
- Pipelining doesn't help KV cache
- Memory wall as the deepest bottleneck
- Gemini 200k context inflection point
- Output 5x input price = decode memory-bound
- Cache-hit pricing reveals memory tier
- Models ~100x overtrained vs Chinchilla
- Cost-equilibrium across pretrain/RL/inference
- Inference token volume ≈ pretrain token volume per model lifetime
- Sub-linear sparsity quality scaling
- Sparse attention sqrt scaling on KV cache
- Context-length ceiling until HBM scales
- RevNets / Feistel cipher activation rematerialisation
- Scale-up domain size as the real GPU-gen unlock