Dwarkesh Podcast

Chip design from the bottom up – Reiner Pope

1h 20m · Transcribed via youtube_fallback · Watch on YouTube

Dwarkesh's second sit-down with **Reiner Pope** (MatX CEO, ex-Google TPU). **This is a teaching episode, not a market one** — Reiner walks the entire chip-design stack from logic gates → multiply-accumulate units → multiplexers → CUDA cores → tensor cores → systolic arrays → CPU/GPU/TPU/FPGA tradeoffs. But the strategic implications for the [Issue 04 memory thesis](/issues/2026-05-10) and the [Issue 05 Cerebras IPO](/issues/2026-05-17) are sharp. Key takeaways: **(1)** Multiply-accumulate as the AI primitive — area scales **quadratically** with bit precision. Nvidia's **B300 acknowledged this with FP4 = 3x FP8 (should be 4x)**; B100/B200 incorrectly used the 2x ratio. **(2) TPU vs GPU at the architectural level** — TPU = few large matrix units (better amortisation of register file costs, larger systolic arrays); GPU = many small SMs (more flexible, higher data-movement bandwidth between vector and matrix units). 'A GPU is essentially a lot of tiny TPUs tiled across the whole chip.' **(3) MatX disclosure**: 'splittable systolic array' — large arrays that can also act as small ones. The architectural bet sits between TPU (too coarse) and GPU (too fragmented). **(4) Scratchpad vs cache** as the cleanest determinism-vs-flexibility tradeoff — TPUs and Groq use scratchpad (deterministic latency); CPUs use caches (variable, faster average). **(5) FPGA = 10x more expensive than ASIC** because LUTs synthesise gates from 16 storage bits when an ASIC just lays down polysilicon directly. **(6) Why CPU cores are huge:** the branch predictor. Strip it out + tighten register files = the GPU's lead over CPU. **The implicit MatX strategic positioning is the most important takeaway:** between Cerebras (wafer-scale extreme) and Nvidia (small-SM GPU), MatX is betting splittable systolic arrays are the right granularity.

Key points

**Multiply-accumulate as the AI primitive — area scales quadratically with bit precision.** Reiner walks through full-adder construction of a 4-bit multiply + 8-bit accumulate using p×q AND gates and p×q full adders. **The quadratic scaling is the entire reason low-precision arithmetic has worked so well for neural nets.** Critical Nvidia disclosure: **B300 acknowledged the quadratic ratio with FP4 = 3x FP8 (should be 4x)**; pre-B300 chips used the wrong 2x ratio. **Implication:** the FP4→FP8 die-area math is now public on Nvidia spec sheets, which means every analyst valuing the [Nvidia $5T market cap by Brad in Issue 04](/issues/2026-05-10) needs to update their FLOP/$ assumptions.
**'A GPU is essentially a lot of tiny TPUs tiled across the whole chip.'** Reiner's most elegant architectural framing. **TPU**: few large matrix units (e.g., one big MXU with vector unit in the middle) — better amortisation of register-file costs, can run larger systolic arrays. Downside: huge data movement through only 2 lines of perimeter. **GPU**: ~100 nearly-identical SMs with L2 memory in the middle — more flexible, can move data through 16+ lines of wiring per tensor core, but constrained to small units of everything. **The trade-off is workload-dependent**: huge matmuls favour TPU; varied workloads with high inter-unit communication favour GPU. **Direct cross-reference to [Krishna Rao's three-platform compute strategy from Issue 05](/issues/2026-05-17)** — Anthropic is the only lab using all three (Nvidia/TPU/Trainium) precisely because workload mix is the bet.
**MatX's 'splittable systolic array' as the architectural middle ground.** Reiner discloses MatX's public design philosophy: 'big systolic arrays that can be small systolic arrays too.' Reads as: **between Cerebras (wafer-scale = single 46,000 sq mm chip) and Nvidia (small-SM GPU with branch-prediction overhead)**, MatX is betting the right unit of compute is dynamic — large arrays for big matmuls, small splits for variable workloads. **Implicit competitive positioning:** Cerebras wins on raw inference throughput (the [Issue 05 Andrew Feldman thesis from this week](/issues/2026-05-24)); MatX bets workload flexibility wins on customer fit. The two AI-chip startups have made opposite architectural bets — both can be right for different customers.
**Scratchpad vs cache as the cleanest determinism-flexibility trade.** **TPUs + Groq use scratchpad**: software explicitly decides what to load from HBM vs scratchpad (different instructions for each). Result: **deterministic latency.** **CPUs use caches**: hardware decides on cache hits/misses based on ambient environment, branch behaviour, RNG. Result: variable latency, but faster average. **CPU caches are the single biggest source of non-determinism** — and the reason high-frequency trading firms like Jane Street prefer FPGAs over CPUs. **The deterministic-latency property is a real product differentiator** for inference workloads where SLA tail-latency matters more than throughput average.
**FPGA vs ASIC = 10x cost differential.** ASIC = polysilicon laid down directly for the desired gates. FPGA = LUTs (lookup tables) synthesise gates from 16 storage bits per gate equivalent. Cost: 'A 4-way AND gate costs 3 gates in an ASIC and 32 gates in an FPGA.' **First-FPGA cost $10K vs first-ASIC cost $30M (tape-out).** Reiner's business framing: FPGA wins when you change workloads every month and don't want to pay tape-out. **Strategic implication for the AI-chip cohort:** the moat of Cerebras, MatX, Groq, Tenstorrent, and others is *not* the chip — it's the willingness/capital/risk-tolerance to keep paying $30M tape-outs for each architectural generation. **First-time AI-chip founders underestimate this by an order of magnitude.**
**The branch predictor as the CPU-vs-GPU difference.** Why are there so many more CUDA cores than CPU cores? *'Inside the CPU, one big use of the area is the cache. Mostly it's the register files rather than the logic units. Both have equivalents in a GPU. But the thing that does not have an equivalent in a GPU is the branch predictor — a whole big area of the CPU that predicts where the next branch is.'* The branch predictor exists because CPU clock speed (1-2 GHz) is faster than the time to evaluate a branch (5ns at 200 MHz). **Strip out branch prediction + tighten register files = the entire GPU lead over CPU on parallel workloads.** This is the most concise explanation of why CPUs lost AI to GPUs in any episode this year.
**Pipeline-register insertion as the area-vs-clock-speed lever.** Splitting a logic cloud in half with a register doubles clock speed but doubles register area. **The hardest case: feedback loops** (e.g., running sums) where you can't insert a register without changing the computation. *'This constraint — where I have a loop in my logic, which all chips have somewhere — is the hardest thing to address and sets the clock cycle.'* **Strategic implication:** 'chips made at the same TSMC 3nm process node can have different clock speeds based on how well they optimised critical paths.' This is why **TSMC + chip-design talent are *both* moats** — same process node, different chip-design teams, different yields and performance.
**On low-precision energy efficiency.** Reiner's most-useful intuition for the [Issue 04 power-bottleneck thesis](/issues/2026-05-10): 'The faster the clock cycle, the bigger the voltage needs to be in order for the signal to settle. Most energy consumption comes from toggling bits 0→1→0. If you run a chip 1000x slower, you have 1000x fewer transitions = ~1000x less energy.' **But it's not a substantial efficiency advantage** because most of the time the chip sits idle anyway. **The brain runs at 'megahertz' not 'gigahertz' — Reiner's framing for why energy comparisons brain-vs-silicon aren't apples to apples.**
**On co-location of memory and compute.** Dwarkesh asks if the brain's neuron-to-neuron unstructured sparsity + memory-compute co-location is the structural advantage over silicon. Reiner: *'Memory and compute are actually co-located on these dies too — that's exactly what the SM register-file-near-ALU design is. The bigger difference is clock cycle: brain is much slower than silicon to preserve energy.'* **Reads as a soft rebuttal to the 'neuromorphic chips will replace GPUs' framing** — the engineering co-location problem is already solved at small scale; the brain's advantages are scale (10^11 neurons, fully connected) and clock-rate energy budget, not topology.
**The deterministic-latency moat for Groq is now explicit.** *'Groq has advertised deterministic latency. TPUs have it in the core. Most CPU chip designers added non-determinism (caches, branch prediction) to win on average performance. Some chip designers have removed those choices.'* **Direct competitive intelligence**: deterministic latency is now a publicly-positioned product differentiator across Groq, TPU, and presumably MatX. **Implication for inference-cost optimisation**: $/token-with-SLA-tail-latency-bound is a different metric than $/token-average, and the deterministic-latency cohort wins on the former. **Worth tracking against Cerebras's $/token claims from [Andrew Feldman this week](/issues/2026-05-24)** — different optimisation function.

Notable quotes

A GPU is essentially a lot of tiny TPUs tiled across the whole chip.

Reiner Pope · 1:10:00

The single reason low-precision arithmetic has worked so well for neural nets is this quadratic scaling — die area scales as p × q with bit precision.

Reiner Pope · 12:00

Nvidia made a change with B300. FP4 is now 3x faster than FP8. It should be 4x. Pre-B300 chips just used the wrong 2x ratio.

Reiner Pope · 13:40

The first FPGA costs you $10,000. The first ASIC costs $30 million because it requires an entire tape-out. The business case for FPGA is when you change workloads every month.

Reiner Pope · 56:20

We've talked publicly about something we call a splittable systolic array — in some sense, big systolic arrays that can be small systolic arrays too.

Reiner Pope · 1:18:20

The CPU has a whole big area dedicated to the branch predictor that does not have an equivalent in a GPU. Stripping that out, along with tighter register files, drives a lot of the GPU gains over the CPU.

Reiner Pope · 1:08:00

Chips made at the same TSMC 3nm node can have different clock speeds based on whether they optimised critical paths well. There will be manufacturing variance.

Reiner Pope · 46:20

Groq has advertised deterministic latency. TPUs have it in the core. Some chip designers added non-determinism to win on average performance — others removed it.

Reiner Pope · 1:06:00

Themes

Multiply-accumulate area scales quadratically with bit precision (B300 acknowledges)
TPU vs GPU = few-big-units vs many-small-units architectural philosophy
MatX splittable systolic array as middle-ground between Cerebras wafer-scale and Nvidia small-SM
Scratchpad vs cache = deterministic vs variable latency as inference SLA differentiator
TSMC node + chip-design talent both moats; same 3nm yields different chip clock speeds

Chip design from the bottom up – Reiner Pope

Key points

Notable quotes

Themes

Mentioned

People

Companies

Ideas