Overview

In a framework often attributed to investor Gavin Baker, AI inference is no longer a single, homogeneous workload.
It is splitting into two fundamentally different stages — prefill and decode — each dominated by distinct memory constraints.

NVIDIA’s strategic direction, especially when viewed through the lens of a Groq-style SRAM-first decode architecture, suggests something deeper than a typical acquisition: a redefinition of inference hardware itself.

What Gavin Baker Is Actually Saying (Without Jargon)

The thesis reduces to one sentence:

Inference is fragmenting into prefill and decode, and each stage demands a different memory architecture.

This is not about FLOPs.
It is about where data lives, how fast it can be accessed, and at what latency cost.

Inference Is No Longer One Workload

Stage Dominant Characteristic Primary Bottleneck
Prefill Large context ingestion Memory capacity
Decode Token-by-token generation Memory bandwidth & latency

Prefill scales with context window size.
Decode scales with how fast weights and activations can be accessed for each token, often one token at a time.

This is why HBM, while exceptional for throughput, is not automatically optimal for decode-heavy or agentic workloads where latency dominates user-perceived performance.

NVIDIA’s Three-Pillar Inference Strategy

Under this worldview, NVIDIA is no longer trying to solve inference with a single GPU design.

Instead, it is assembling a modular inference family, where silicon and memory are paired to specific workloads:

Component Memory Type Optimized For
Rubin CPX GDDR (large capacity, lower bandwidth) Massive context prefill
Rubin HBM (balanced) Training and batch inference
Groq-derived SRAM decode SRAM (ultra-high bandwidth, ultra-low latency) Low-latency decode and agentic reasoning

This is a critical shift.

Inference is no longer monolithic — it is composable.

Who Is Structurally Disadvantaged

AMD MI300 and Intel Gaudi

Both MI300 and Gaudi are fundamentally HBM-centric designs.

HBM excels at:

  • High throughput
  • Large batch sizes
  • Parallel workloads

But decode increasingly behaves like:

  • Small batch
  • Latency-sensitive
  • Memory-access dominated

Once decode is treated as an SRAM-first problem, pure GPU inference faces a structural disadvantage — not because of software, but because of memory philosophy.

Independent Inference ASIC Startups (Except Groq)

In this framework, Groq is not a platform — it is a decode module.

That distinction is fatal for startups attempting to build “full-stack inference platforms.” Their value lies in a specific memory-latency solution, which can be modularized and absorbed by a larger ecosystem.

Groq’s relevance comes from usefulness, not independence.

Who Can Still Win (By Not Competing at the Same Layer)

Edge and SoC Vendors

Qualcomm, MediaTek, NXP, Infineon, and automotive SoC vendors operate in a different universe:

  • On-device inference
  • Fixed or semi-fixed models
  • Compile-time graphs
  • SRAM + small DRAM

They are not solving hyperscaler inference economics. They are solving power, cost, and determinism.

Broadcom and Custom ASIC Integrators

Hyperscalers increasingly ask:

“I only care about decode. Build me a SRAM-heavy chip.”

That is Broadcom’s domain.

AVGO does not build platforms. It builds exactly one piece of a customer’s inference puzzle.

The Silent Winners: Memory and Packaging

If inference is about memory mix, then the winners include:

  • SK hynix / Micron (HBM supply)
  • TSMC CoWoS / SoIC (advanced packaging bottlenecks)
  • Synopsys / Cadence (SRAM IP and compiler tooling)

NVIDIA wins — but the supply chain wins with it.

Google TPU: A Special Case

Google TPU survives because it does not need to sell silicon.

It only needs to minimize internal inference cost.

TPU can adopt SRAM-heavy decode designs internally, but this does not translate into a competitive external platform.

Who Is Most Exposed

AMD stands out.

MI300 is HBM-centric. There is no SRAM-first decode path, and no modular inference assembly strategy.

This is not an execution issue. It is a memory worldview mismatch.

One-Sentence Conclusion

NVIDIA is not buying Groq — it is buying the hardware definition of decode.

Once inference is understood as:

  • prefill = capacity problem
  • decode = bandwidth and latency problem

whoever controls memory architecture controls inference economics.

This is not another acquisition. It is NVIDIA reclaiming the right to define inference hardware.

Sources