Running a trillion-parameter LLM locally has long seemed impractical for anyone outside a data center. An enthusiast just proved otherwise by deploying Kimi K2.5, a 1-trillion-parameter model, on a single-GPU workstation equipped with 768GB of Intel Optane DIMM memory sticks, achieving roughly 4 tokens per second during inference.
Key Takeaways
- A Redditor configured a workstation using 768GB of Intel Optane persistent memory DIMMs to run Kimi K2.5 locally.
- The setup delivers approximately 4 tokens per second inference speed with a single GPU.
- Intel Optane PMem, originally designed for enterprise servers, is being repurposed for AI workloads as older stock becomes available secondhand.
- Trillion-parameter models remain impractical for real-time use but are now technically accessible on consumer-adapted hardware.
- This build exemplifies how abandoned enterprise memory can unlock unusual AI capabilities outside mainstream markets.
What This Build Actually Achieves
The core achievement is functional, not fast. Running a trillion-parameter model on consumer-grade infrastructure has historically required cloud access or specialized hardware. This setup proves that with enough persistent memory, a single workstation can host and inference such a model locally, albeit slowly. The reported 4 tokens per second is roughly the speed of typing a single sentence every few seconds—usable for experimentation but not for interactive chatting or real-time applications.
The builder leveraged Intel Optane Persistent Memory DIMMs, which were originally engineered for enterprise data centers where they sat between DRAM and NVMe storage in the memory hierarchy. These DIMMs offer substantially higher capacity than conventional DDR4 or DDR5 modules, making them attractive for memory-hungry AI workloads. By acquiring secondhand Optane stock, the enthusiast sidestepped the premium pricing that kept this hardware confined to server rooms.
Why Intel Optane Memory Matters for Local AI
Conventional DRAM has a hard ceiling. A workstation with 192GB or 256GB of DDR4 cannot fit a trillion-parameter model in memory, forcing the system to swap to disk—a process so slow it renders inference unusable. Intel Optane PMem operates at much higher speeds than NVMe, sitting closer to DRAM latency while offering vastly more capacity. This architectural middle ground is precisely what trillion-parameter inference demands.
The trade-off is clear: speed for scale. A cloud-hosted version of Kimi K2.5 would likely deliver 20 to 100 tokens per second depending on the provider’s infrastructure. This DIY build delivers one-tenth that throughput. Yet the ability to run the model entirely locally, without cloud costs or data transmission, appeals to privacy-conscious users and AI experimenters who prioritize control over speed.
The Broader Implication for Repurposed Enterprise Hardware
This build illustrates a growing trend in AI hobbyist circles: mining value from enterprise hardware that has been displaced by newer generations. Intel Optane PMem was a strategic bet that did not pan out in mainstream server markets. Cloud providers and hyperscalers largely moved past it, leaving stockpiles of secondhand DIMMs available at a fraction of original cost.
The same pattern has played out with other enterprise memory and accelerators. GPUs, networking cards, and specialized processors designed for data centers often find second lives in AI research labs, cryptocurrency mining operations, and now local LLM inference rigs. The economics work because the original buyer (a corporation) absorbs the depreciation; the hobbyist or small operator buys at a steep discount and extracts utility that would otherwise be wasted.
However, this remains a niche play. Trillion-parameter models are not practical for most users even at 4 tokens per second. The build is intellectually interesting and technically impressive, but it does not represent a path to mainstream local inference. Smaller models—7 billion to 70 billion parameters—remain far more practical for local deployment on standard consumer hardware.
Can This Approach Scale?
The answer is no, not broadly. Optane PMem is a finite resource. Intel discontinued the product line, and secondhand supplies will eventually dry up. Newer persistent memory technologies (like Intel Optane 3D XPoint successors, if they materialize) might offer similar advantages, but they are not yet available at scale. Additionally, the workstation build requires specialized motherboards and BIOS support for Optane PMem—compatibility is limited and not trivial to achieve.
For the vast majority of users seeking local LLM inference, smaller open-source models running on conventional GPUs and RAM remain the practical choice. Llama 2 70B, Mistral 8x7B, and other mid-sized models deliver useful performance without exotic hardware. The trillion-parameter build is a proof-of-concept, not a blueprint for mass adoption.
Is local trillion-parameter LLM inference useful?
At 4 tokens per second, the answer depends on your tolerance for latency. For batch processing, research, or non-interactive experimentation, it is workable. For chatting, coding assistance, or real-time applications, the speed is too slow. Most users would find cloud-hosted inference or smaller local models more practical despite the trade-offs in privacy and cost.
Can I build this setup myself?
Technically yes, but with caveats. You need a workstation motherboard that supports Intel Optane PMem (typically Xeon-based platforms), a compatible CPU, 768GB of Optane DIMMs from the secondhand market, a GPU, and the software stack to run Kimi K2.5 locally. The hardware sourcing is the main challenge—Optane PMem is not sold new at consumer retailers, and finding compatible used stock requires patience and knowledge of enterprise hardware markets. The software side is more straightforward if you are comfortable with open-source LLM frameworks.
What GPU did the builder use?
The research brief does not specify the GPU model. The article mentions a single GPU but does not name it. The choice of GPU matters for inference speed, but without that detail confirmed, any specification would be speculation.
This build is a reminder that AI hardware innovation is not confined to official product launches and benchmark suites. Enthusiasts continue to find creative ways to extract capability from abandoned enterprise infrastructure, pushing the boundaries of what is technically possible at home. For most users, though, the practical path forward remains smaller models on accessible hardware—not trillion-parameter behemoths, no matter how cleverly configured.
Edited by the All Things Geek team.
Source: Tom's Hardware


