Groq 3 LPU inference is now officially part of NVIDIA’s Vera Rubin platform, announced at GTC 2026 as the company’s answer to a fast-moving inference market where cloud giants are building their own custom silicon. The Groq 3 LPU packs 500 MB of on-chip SRAM per chip — more than double the 230 MB found in LPU v1 — and in LPX rack configurations, 256 of these processors combine to deliver 128 GB of total on-chip SRAM and 640 TB/s of scale-up memory bandwidth.
TL;DR: NVIDIA has integrated Groq’s LPU architecture into the Vera Rubin platform at GTC 2026. The Groq 3 LPU offers 500 MB of on-chip SRAM per chip, and LPX racks claim up to 35x higher inference throughput per megawatt versus GPU-only configurations. This is NVIDIA’s most direct move yet against cloud-native inference challengers.
What is the Groq 3 LPU and why does SRAM matter for AI inference?
The Groq 3 LPU is a Language Processing Unit — a chip purpose-built for AI inference rather than training — that uses on-chip SRAM instead of the external High Bandwidth Memory found in conventional GPUs. On-chip SRAM operates at over 80 terabytes per second, compared to roughly 8 terabytes per second for GPU off-chip HBM. That tenfold bandwidth advantage is not a marketing footnote — it’s the entire architectural argument for LPUs.
Why does this matter? Modern AI inference, especially with trillion-parameter models, is fundamentally a memory-bandwidth problem. The model weights need to move from storage into compute as fast as possible for every output token. GPU architectures fetch data from external HBM, which introduces latency at every step. The LPU’s on-chip SRAM eliminates that fetch entirely, which is how Groq can claim the chip boosts every layer of the AI model on every token.
The jump from LPU v1 to Groq 3 LPU is significant on paper: 500 MB of SRAM per chip versus 230 MB previously. That’s not a minor revision — it’s a fundamental capacity expansion that allows the chip to hold larger model fragments on-die and reduce the number of memory round-trips per inference pass.
How does Groq 3 LPU inference compare to NVIDIA’s own GPU performance?
Groq 3 LPU inference speed benchmarks against GPU alternatives are stark. For Llama 2 70B, the Groq LPU delivers around 300 tokens per second, while NVIDIA’s H100 GPU manages 30 to 40 tokens per second for the same workload. That’s roughly a 10x speed advantage for inference tasks — which aligns with NVIDIA’s own claim that LPUs are approximately 10x more energy-efficient per token than GPUs at an architectural level.
The H200, NVIDIA’s current top-tier training GPU, carries 141 GB of HBM3e memory. A single Groq LPU chip holds just 500 MB of SRAM — orders of magnitude less total capacity. But total capacity isn’t the relevant metric for inference throughput. Bandwidth and latency are. The LPU wins on both, which is precisely why NVIDIA spent $20 billion acquiring Groq rather than trying to out-engineer the architecture internally.
Amazon, Google, and Microsoft are all developing custom inference silicon in-house. NVIDIA’s response isn’t to build a faster H-series GPU for inference — it’s to absorb the company that already solved the inference problem architecturally.
What does the Vera Rubin platform with LPX racks actually deliver?
The Vera Rubin platform now comprises NVIDIA’s Vera CPU, Rubin GPU, NVLink 6 Switch, ConnectX-9 SuperNIC, BlueField-4 DPU, Spectrum-6 Ethernet switch, and the newly integrated Groq 3 LPU. The LPX rack — housing 256 LPU processors — is designed to operate alongside the Vera Rubin NVL72 configuration, with the GPUs handling training and prefill while the LPUs take over the decode phase.
NVIDIA claims that running LPX racks alongside Vera Rubin delivers up to 35x higher inference throughput per megawatt and up to 10x more revenue opportunity for trillion-parameter models. Both figures are promotional claims without independent methodology behind them — treat them as directional rather than absolute. But the architectural logic is sound: offloading the decode-heavy inference workload to purpose-built LPUs frees GPU compute for prefill and training, which changes the economics of running large-context AI systems.
The platform targets trillion-parameter models with million-token contexts. That’s not a current mainstream use case, but it will be. Enterprises building agentic AI systems — where models need to reason across enormous context windows — will hit GPU memory walls long before they hit compute limits. The LPX rack’s 128 GB of on-chip SRAM at 640 TB/s bandwidth is a direct architectural answer to that problem.
Is the NVIDIA Groq acquisition the right move for the inference market?
NVIDIA’s $20 billion investment in Groq is the most aggressive strategic move the company has made in the inference space. Training dominance with H100 and H200 GPUs is not transferable to inference dominance — the two workloads have fundamentally different bottlenecks. NVIDIA recognised this, and rather than iterating on GPU architecture to close the gap, it bought the gap-closer outright.
The inference market is still contested. Cloud providers building in-house silicon aren’t going away, and they have the distribution advantage of owning the infrastructure their models run on. But integrating Groq’s LPU architecture directly into the Vera Rubin platform gives NVIDIA something its competitors can’t easily replicate: a unified training-to-inference stack with purpose-built silicon at both ends.
How fast is the Groq LPU compared to an H100 GPU for inference?
For Llama 2 70B, the Groq LPU delivers approximately 300 tokens per second versus 30 to 40 tokens per second on the NVIDIA H100. That’s roughly a 10x throughput advantage for inference specifically. The H100 remains the dominant chip for AI training workloads, where the LPU’s architecture offers no comparable advantage.
What is the LPX rack configuration in the Vera Rubin platform?
An LPX rack contains 256 Groq 3 LPU processors, delivering 128 GB of total on-chip SRAM and 640 TB/s of scale-up memory bandwidth. These racks are designed to work alongside Vera Rubin NVL72 systems, with LPUs handling the decode phase of inference while Rubin GPUs manage training and prefill tasks.
Why did NVIDIA acquire Groq instead of building its own inference chip?
NVIDIA paid $20 billion for Groq because the LPU’s on-chip SRAM architecture solves the memory-bandwidth bottleneck in AI inference in a way that GPU HBM architectures structurally cannot match. Building a competing LPU from scratch would have taken years. With Amazon, Google, and Microsoft all developing custom inference silicon, NVIDIA couldn’t afford the timeline.
The Groq 3 LPU and LPX racks represent a genuine architectural shift in how NVIDIA thinks about AI infrastructure — not just faster GPUs, but a hybrid platform where the right chip handles the right workload. Whether the 35x throughput-per-megawatt claim holds up under independent scrutiny matters less right now than the strategic signal it sends: NVIDIA is done ceding the inference conversation to anyone.
Edited by the All Things Geek team.
Source: Tom's Hardware


