← Back to issue
Dwarkesh Podcast

Chip design from the bottom up – Reiner Pope

1h 20m · Transcribed via youtube_fallback · Watch on YouTube

Dwarkesh's second sit-down with **Reiner Pope** (MatX CEO, ex-Google TPU). **This is a teaching episode, not a market one** — Reiner walks the entire chip-design stack from logic gates → multiply-accumulate units → multiplexers → CUDA cores → tensor cores → systolic arrays → CPU/GPU/TPU/FPGA tradeoffs. But the strategic implications for the [Issue 04 memory thesis](/issues/2026-05-10) and the [Issue 05 Cerebras IPO](/issues/2026-05-17) are sharp. Key takeaways: **(1)** Multiply-accumulate as the AI primitive — area scales **quadratically** with bit precision. Nvidia's **B300 acknowledged this with FP4 = 3x FP8 (should be 4x)**; B100/B200 incorrectly used the 2x ratio. **(2) TPU vs GPU at the architectural level** — TPU = few large matrix units (better amortisation of register file costs, larger systolic arrays); GPU = many small SMs (more flexible, higher data-movement bandwidth between vector and matrix units). 'A GPU is essentially a lot of tiny TPUs tiled across the whole chip.' **(3) MatX disclosure**: 'splittable systolic array' — large arrays that can also act as small ones. The architectural bet sits between TPU (too coarse) and GPU (too fragmented). **(4) Scratchpad vs cache** as the cleanest determinism-vs-flexibility tradeoff — TPUs and Groq use scratchpad (deterministic latency); CPUs use caches (variable, faster average). **(5) FPGA = 10x more expensive than ASIC** because LUTs synthesise gates from 16 storage bits when an ASIC just lays down polysilicon directly. **(6) Why CPU cores are huge:** the branch predictor. Strip it out + tighten register files = the GPU's lead over CPU. **The implicit MatX strategic positioning is the most important takeaway:** between Cerebras (wafer-scale extreme) and Nvidia (small-SM GPU), MatX is betting splittable systolic arrays are the right granularity.

Key points

Notable quotes

A GPU is essentially a lot of tiny TPUs tiled across the whole chip.

Reiner Pope · 1:10:00

The single reason low-precision arithmetic has worked so well for neural nets is this quadratic scaling — die area scales as p × q with bit precision.

Reiner Pope · 12:00

Nvidia made a change with B300. FP4 is now 3x faster than FP8. It should be 4x. Pre-B300 chips just used the wrong 2x ratio.

Reiner Pope · 13:40

The first FPGA costs you $10,000. The first ASIC costs $30 million because it requires an entire tape-out. The business case for FPGA is when you change workloads every month.

Reiner Pope · 56:20

We've talked publicly about something we call a splittable systolic array — in some sense, big systolic arrays that can be small systolic arrays too.

Reiner Pope · 1:18:20

The CPU has a whole big area dedicated to the branch predictor that does not have an equivalent in a GPU. Stripping that out, along with tighter register files, drives a lot of the GPU gains over the CPU.

Reiner Pope · 1:08:00

Chips made at the same TSMC 3nm node can have different clock speeds based on whether they optimised critical paths well. There will be manufacturing variance.

Reiner Pope · 46:20

Groq has advertised deterministic latency. TPUs have it in the core. Some chip designers added non-determinism to win on average performance — others removed it.

Reiner Pope · 1:06:00

Themes

Mentioned