Chip design from the bottom up – Reiner Pope
Dwarkesh's second sit-down with **Reiner Pope** (MatX CEO, ex-Google TPU). **This is a teaching episode, not a market one** — Reiner walks the entire chip-design stack from logic gates → multiply-accumulate units → multiplexers → CUDA cores → tensor cores → systolic arrays → CPU/GPU/TPU/FPGA tradeoffs. But the strategic implications for the [Issue 04 memory thesis](/issues/2026-05-10) and the [Issue 05 Cerebras IPO](/issues/2026-05-17) are sharp. Key takeaways: **(1)** Multiply-accumulate as the AI primitive — area scales **quadratically** with bit precision. Nvidia's **B300 acknowledged this with FP4 = 3x FP8 (should be 4x)**; B100/B200 incorrectly used the 2x ratio. **(2) TPU vs GPU at the architectural level** — TPU = few large matrix units (better amortisation of register file costs, larger systolic arrays); GPU = many small SMs (more flexible, higher data-movement bandwidth between vector and matrix units). 'A GPU is essentially a lot of tiny TPUs tiled across the whole chip.' **(3) MatX disclosure**: 'splittable systolic array' — large arrays that can also act as small ones. The architectural bet sits between TPU (too coarse) and GPU (too fragmented). **(4) Scratchpad vs cache** as the cleanest determinism-vs-flexibility tradeoff — TPUs and Groq use scratchpad (deterministic latency); CPUs use caches (variable, faster average). **(5) FPGA = 10x more expensive than ASIC** because LUTs synthesise gates from 16 storage bits when an ASIC just lays down polysilicon directly. **(6) Why CPU cores are huge:** the branch predictor. Strip it out + tighten register files = the GPU's lead over CPU. **The implicit MatX strategic positioning is the most important takeaway:** between Cerebras (wafer-scale extreme) and Nvidia (small-SM GPU), MatX is betting splittable systolic arrays are the right granularity.
Key points
- **Multiply-accumulate as the AI primitive — area scales quadratically with bit precision.** Reiner walks through full-adder construction of a 4-bit multiply + 8-bit accumulate using p×q AND gates and p×q full adders. **The quadratic scaling is the entire reason low-precision arithmetic has worked so well for neural nets.** Critical Nvidia disclosure: **B300 acknowledged the quadratic ratio with FP4 = 3x FP8 (should be 4x)**; pre-B300 chips used the wrong 2x ratio. **Implication:** the FP4→FP8 die-area math is now public on Nvidia spec sheets, which means every analyst valuing the [Nvidia $5T market cap by Brad in Issue 04](/issues/2026-05-10) needs to update their FLOP/$ assumptions.
- **'A GPU is essentially a lot of tiny TPUs tiled across the whole chip.'** Reiner's most elegant architectural framing. **TPU**: few large matrix units (e.g., one big MXU with vector unit in the middle) — better amortisation of register-file costs, can run larger systolic arrays. Downside: huge data movement through only 2 lines of perimeter. **GPU**: ~100 nearly-identical SMs with L2 memory in the middle — more flexible, can move data through 16+ lines of wiring per tensor core, but constrained to small units of everything. **The trade-off is workload-dependent**: huge matmuls favour TPU; varied workloads with high inter-unit communication favour GPU. **Direct cross-reference to [Krishna Rao's three-platform compute strategy from Issue 05](/issues/2026-05-17)** — Anthropic is the only lab using all three (Nvidia/TPU/Trainium) precisely because workload mix is the bet.
- **MatX's 'splittable systolic array' as the architectural middle ground.** Reiner discloses MatX's public design philosophy: 'big systolic arrays that can be small systolic arrays too.' Reads as: **between Cerebras (wafer-scale = single 46,000 sq mm chip) and Nvidia (small-SM GPU with branch-prediction overhead)**, MatX is betting the right unit of compute is dynamic — large arrays for big matmuls, small splits for variable workloads. **Implicit competitive positioning:** Cerebras wins on raw inference throughput (the [Issue 05 Andrew Feldman thesis from this week](/issues/2026-05-24)); MatX bets workload flexibility wins on customer fit. The two AI-chip startups have made opposite architectural bets — both can be right for different customers.
- **Scratchpad vs cache as the cleanest determinism-flexibility trade.** **TPUs + Groq use scratchpad**: software explicitly decides what to load from HBM vs scratchpad (different instructions for each). Result: **deterministic latency.** **CPUs use caches**: hardware decides on cache hits/misses based on ambient environment, branch behaviour, RNG. Result: variable latency, but faster average. **CPU caches are the single biggest source of non-determinism** — and the reason high-frequency trading firms like Jane Street prefer FPGAs over CPUs. **The deterministic-latency property is a real product differentiator** for inference workloads where SLA tail-latency matters more than throughput average.
- **FPGA vs ASIC = 10x cost differential.** ASIC = polysilicon laid down directly for the desired gates. FPGA = LUTs (lookup tables) synthesise gates from 16 storage bits per gate equivalent. Cost: 'A 4-way AND gate costs 3 gates in an ASIC and 32 gates in an FPGA.' **First-FPGA cost $10K vs first-ASIC cost $30M (tape-out).** Reiner's business framing: FPGA wins when you change workloads every month and don't want to pay tape-out. **Strategic implication for the AI-chip cohort:** the moat of Cerebras, MatX, Groq, Tenstorrent, and others is *not* the chip — it's the willingness/capital/risk-tolerance to keep paying $30M tape-outs for each architectural generation. **First-time AI-chip founders underestimate this by an order of magnitude.**
- **The branch predictor as the CPU-vs-GPU difference.** Why are there so many more CUDA cores than CPU cores? *'Inside the CPU, one big use of the area is the cache. Mostly it's the register files rather than the logic units. Both have equivalents in a GPU. But the thing that does not have an equivalent in a GPU is the branch predictor — a whole big area of the CPU that predicts where the next branch is.'* The branch predictor exists because CPU clock speed (1-2 GHz) is faster than the time to evaluate a branch (5ns at 200 MHz). **Strip out branch prediction + tighten register files = the entire GPU lead over CPU on parallel workloads.** This is the most concise explanation of why CPUs lost AI to GPUs in any episode this year.
- **Pipeline-register insertion as the area-vs-clock-speed lever.** Splitting a logic cloud in half with a register doubles clock speed but doubles register area. **The hardest case: feedback loops** (e.g., running sums) where you can't insert a register without changing the computation. *'This constraint — where I have a loop in my logic, which all chips have somewhere — is the hardest thing to address and sets the clock cycle.'* **Strategic implication:** 'chips made at the same TSMC 3nm process node can have different clock speeds based on how well they optimised critical paths.' This is why **TSMC + chip-design talent are *both* moats** — same process node, different chip-design teams, different yields and performance.
- **On low-precision energy efficiency.** Reiner's most-useful intuition for the [Issue 04 power-bottleneck thesis](/issues/2026-05-10): 'The faster the clock cycle, the bigger the voltage needs to be in order for the signal to settle. Most energy consumption comes from toggling bits 0→1→0. If you run a chip 1000x slower, you have 1000x fewer transitions = ~1000x less energy.' **But it's not a substantial efficiency advantage** because most of the time the chip sits idle anyway. **The brain runs at 'megahertz' not 'gigahertz' — Reiner's framing for why energy comparisons brain-vs-silicon aren't apples to apples.**
- **On co-location of memory and compute.** Dwarkesh asks if the brain's neuron-to-neuron unstructured sparsity + memory-compute co-location is the structural advantage over silicon. Reiner: *'Memory and compute are actually co-located on these dies too — that's exactly what the SM register-file-near-ALU design is. The bigger difference is clock cycle: brain is much slower than silicon to preserve energy.'* **Reads as a soft rebuttal to the 'neuromorphic chips will replace GPUs' framing** — the engineering co-location problem is already solved at small scale; the brain's advantages are scale (10^11 neurons, fully connected) and clock-rate energy budget, not topology.
- **The deterministic-latency moat for Groq is now explicit.** *'Groq has advertised deterministic latency. TPUs have it in the core. Most CPU chip designers added non-determinism (caches, branch prediction) to win on average performance. Some chip designers have removed those choices.'* **Direct competitive intelligence**: deterministic latency is now a publicly-positioned product differentiator across Groq, TPU, and presumably MatX. **Implication for inference-cost optimisation**: $/token-with-SLA-tail-latency-bound is a different metric than $/token-average, and the deterministic-latency cohort wins on the former. **Worth tracking against Cerebras's $/token claims from [Andrew Feldman this week](/issues/2026-05-24)** — different optimisation function.
Notable quotes
A GPU is essentially a lot of tiny TPUs tiled across the whole chip.
The single reason low-precision arithmetic has worked so well for neural nets is this quadratic scaling — die area scales as p × q with bit precision.
Nvidia made a change with B300. FP4 is now 3x faster than FP8. It should be 4x. Pre-B300 chips just used the wrong 2x ratio.
The first FPGA costs you $10,000. The first ASIC costs $30 million because it requires an entire tape-out. The business case for FPGA is when you change workloads every month.
We've talked publicly about something we call a splittable systolic array — in some sense, big systolic arrays that can be small systolic arrays too.
The CPU has a whole big area dedicated to the branch predictor that does not have an equivalent in a GPU. Stripping that out, along with tighter register files, drives a lot of the GPU gains over the CPU.
Chips made at the same TSMC 3nm node can have different clock speeds based on whether they optimised critical paths well. There will be manufacturing variance.
Groq has advertised deterministic latency. TPUs have it in the core. Some chip designers added non-determinism to win on average performance — others removed it.
Themes
- Multiply-accumulate area scales quadratically with bit precision (B300 acknowledges)
- TPU vs GPU = few-big-units vs many-small-units architectural philosophy
- MatX splittable systolic array as middle-ground between Cerebras wafer-scale and Nvidia small-SM
- Scratchpad vs cache = deterministic vs variable latency as inference SLA differentiator
- TSMC node + chip-design talent both moats; same 3nm yields different chip clock speeds
Mentioned
People
Ideas
- MatX 'splittable systolic array' architecture
- Multiply-accumulate as AI primitive with quadratic area scaling in bit precision
- Nvidia B300 acknowledges 3x FP4 vs FP8 (should be 4x)
- TPU = few large matrix units (better register-file amortisation)
- GPU = many small SMs (more flexible, higher inter-unit bandwidth)
- GPU as 'tiny TPUs tiled across whole chip'
- Splittable systolic array as middle ground between TPU and GPU
- Scratchpad (TPU/Groq) vs cache (CPU) — deterministic vs variable latency
- FPGA = 10x more expensive than ASIC ($30M tape-out vs $10K first FPGA)
- Branch predictor as the CPU-vs-GPU area difference
- Pipeline-register insertion as area-vs-clock-speed lever
- Feedback loops as the hardest constraint setting clock cycle
- Same TSMC node, different clock speeds (chip-design team quality moat)
- Deterministic latency as inference SLA product differentiator (Groq + TPU + MatX)
- Brain runs at megahertz vs silicon gigahertz (energy budget)
- Memory-compute co-location is solved at SM scale; brain advantage is topology + scale not co-location
- Dadda multiplier as standard area-efficient summation
- LUT (lookup table) as truth-table programmable gate (FPGA primitive)
- Mux cost = n×p AND + (n-1)×p OR gates