The NVIDIA DGX Spark is the productized version of Project DIGITS — a 150 mm cube housing the GB10 Grace Blackwell Superchip, 128 GB unified LPDDR5x, and the full CUDA AI stack out of the box. Tom's Hardware called it 'a well-rounded toolkit for local AI'; ServeTheHome called it 'must-have for AI developers'; LMSYS published the most thorough independent benchmarks. The 128 GB unified-memory ceiling is the headline feature: it loads models that would otherwise need a $30K+ multi-GPU rig. The catch is bandwidth-limited decode — LMSYS measured Llama-3.1 70B FP8 at 2.7 tokens/sec single-batch, while GPT-OSS 120B (MoE, ~17B active) hits ~14.5 tokens/sec per ServeTheHome. Best understood as a CUDA-native development box for buyers who need to iterate on big-model code without renting cloud GPUs.

Full review
What It Is and Who It's For
The DGX Spark started life as Project DIGITS, NVIDIA's January 2025 announcement of a $3,000 personal AI supercomputer. Shipping under the DGX Spark name from October 15, 2025, it landed at $3,999 before a memory-supply price increase pushed the MSRP to $4,699 in February 2026. The hardware is straightforward: a 150 x 150 x 50.5 mm cube weighing 1.2 kg, built around the GB10 Grace Blackwell Superchip — a 20-core Arm CPU (10x Cortex-X925 + 10x Cortex-A725) paired with a Blackwell GPU featuring 5th-gen Tensor Cores and up to 1 PFLOP of FP4 (sparse) compute. Memory is 128 GB of unified LPDDR5x at 273 GB/s, and storage is a 4 TB user-replaceable self-encrypting NVMe M.2. The whole thing draws 240 W max from a wall-wart-style external PSU. NVIDIA's positioning is unambiguous: this is a development box for engineers who need to iterate on large-model code without renting cloud GPUs, not a general-purpose workstation.
Local LLM Performance
The independent benchmark of record for the DGX Spark is LMSYS's October 2025 deep-dive, which ran SGLang-on-FP8 numbers across multiple model sizes. Llama 3.1 8B FP8 hits 7,991 tokens/sec prefill and 20.5 tokens/sec decode at batch 1, scaling to 368 tokens/sec decode at batch 32. Llama 3.1 70B FP8 measures 803 tokens/sec prefill and 2.7 tokens/sec decode at batch 1 — bandwidth-limited at the 273 GB/s memory ceiling. ServeTheHome reported approximately 14.5 tokens/sec on GPT-OSS 120B, which sounds counterintuitive until you remember GPT-OSS is mixture-of-experts with only ~17B active parameters per token, so memory bandwidth pressure is much lower than a dense 70B. The practical takeaway: the DGX Spark is fastest on small-to-mid models (≤30B) and on MoE models, slow on dense 70B+. For sustained high-throughput dense-70B inference, an RTX 5090 or a 4x A6000 box is several times faster. For development — fitting models in memory, validating training scripts, doing single-batch experiments — the DGX Spark is the cheapest and most polished tool available.
Software and Toolchain
Where the DGX Spark genuinely earns its premium over a Strix Halo mini PC is the software stack. NVIDIA ships it with the full DGX OS image preinstalled — CUDA, cuDNN, TensorRT-LLM, NeMo, NIM containers, the latest PyTorch and JAX builds, and the NVIDIA AI Enterprise license. For a developer whose code is written against CUDA, this is a same-day install-and-go experience. The Strix Halo and Mac Studio alternatives can run local LLMs well via Ollama, llama.cpp, or MLX, but porting CUDA-native research code to those stacks is a real engineering tax. ServeTheHome's review framed this as the practical reason the DGX Spark exists: every AI engineer's existing PyTorch+CUDA workflow runs unmodified, including training small models, fine-tuning with LoRA/QLoRA, and exporting TensorRT-optimized inference engines for production deployment on bigger NVIDIA hardware.
Clustering and Scale-Up
The DGX Spark ships with two QSFP56 ports backed by a ConnectX-7 NIC, supporting up to 200 GbE RDMA. NVIDIA's documentation and LMSYS's testing show that two Sparks linked over ConnectX-7 can collectively address 256 GB of unified memory and run 405B-parameter models at low-bit quantization — a ceiling that is otherwise the exclusive domain of multi-GPU rigs costing $30K+. At a street price of roughly $9,400 for a two-unit cluster, this is the cheapest legal path to running Llama-3-405B-class quants on a desk. The trade-off is the same bandwidth wall: dense 405B inference at low batch sizes is single-digit tokens/sec. Buyers running fine-tunes or small-batch evaluations of frontier models will find this acceptable; buyers needing production-grade serving will not.
Where It Falls Short
The DGX Spark is unapologetically specialist hardware. It runs only NVIDIA's DGX OS — no Windows, no gaming, no creative-suite workflows. It has a single HDMI output and no front-panel LEDs or display indicators; the chassis is functional rather than decorative. The 273 GB/s memory bandwidth is the hard ceiling on dense large-model decode performance, and there's no path to upgrade it. The price increase from $3,999 to $4,699 in February 2026 (driven by global LPDDR5x memory supply pressure) ate into the original value proposition versus a Strix Halo mini PC, which now costs roughly one-third as much for the same memory ceiling. Buyers who don't specifically need CUDA-native software, a polished out-of-box experience, or two-unit clustering will get more raw inference performance per dollar from a Mac Studio M3 Ultra or a single-GPU PC build.
Strengths
- +128 GB unified LPDDR5x memory — fits 70B FP8 / 120B Q4 / 405B with two clustered units
- +Full CUDA + NVIDIA AI stack preinstalled; the most polished local-AI dev box on the market
- +Compact 150 mm cube, 240 W max — fits any desk, runs cool and quiet
- +Dual ConnectX-7 200 GbE QSFP56 ports for low-latency two-unit clustering
Watch-outs
- −273 GB/s LPDDR5x bandwidth caps decode tok/s on dense large models — 70B FP8 measures ~2.7 tok/s on a single unit
- −Linux-only, no Windows or gaming use; specialist hardware for AI developers
- −Price raised from $3,999 to $4,699 in February 2026 due to memory supply
How it compares
The DGX Spark is the cheapest path to 128 GB of CUDA-addressable unified memory anywhere on the market. Versus the GMKtec EVO-X2 ($1,699) or Beelink GTR9 Pro ($2,000), it's roughly 2.5x the price but offers the full NVIDIA software stack the Strix Halo boxes can only approximate via ROCm or Vulkan. Versus the Puget Genesis II ($10K+), it's a single-purpose dev box — no multi-display creative workflow, no gaming, no general workstation duty. Pair two Sparks via the ConnectX-7 networking and you get 405B-class model coverage at roughly $9,400, the cheapest legal path to that ceiling.
Who this is for
At a glance: Best for for local-llm developers — CUDA-native 128 GB dev box.
Why you’d buy the NVIDIA DGX Spark
- 128 GB unified LPDDR5x memory — fits 70B FP8 / 120B Q4 / 405B with two clustered units.
- Full CUDA + NVIDIA AI stack preinstalled; the most polished local-AI dev box on the market.
- Compact 150 mm cube, 240 W max — fits any desk, runs cool and quiet.
Why you’d skip it
- 273 GB/s LPDDR5x bandwidth caps decode tok/s on dense large models — 70B FP8 measures ~2.7 tok/s on a single unit.
- Linux-only, no Windows or gaming use; specialist hardware for AI developers.
- Price raised from $3,999 to $4,699 in February 2026 due to memory supply.
Rating sources
“A well-rounded toolkit for local AI — pricey if you don't use its features to the fullest.”
“So freaking cool. A must-have for AI developers.”
“A new standard for local AI inference.”
Our 4.6 score is the average of these published ratings. Ratings marked * were derived from the reviewer’s written analysis or video transcript — the publisher didn’t print an explicit numeric score, so we inferred one from their own words. Click through to verify. More about methodology.



