DGX Spark vs AMD Strix Halo

I know that many people are cross-shopping DGX Spark and AMD Strix Halo systems (Framework Desktop, etc) for a low power solution that can do some AI/LLM stuff.

There are a lot of reviews on the Web and YouTube, but most people making those don’t work with AI and specifically LLMs for a living, so we see them doing silly things like testing with Ollama (especially on AMD device!).

Since I’ve got both of these, I figured that it would be useful to share my initial impression from both. I’ve my Strix Halo system (GMKTek Evo x2 128GB) for about a week and DGX Spark just for a couple of days, so it’s definitely work in progress.

I made a post on Reddit, but figured that cross-posting here may help some folks.

Hardware

DGX Spark is probably the most minimalist mini-PC I’ve ever used.

It has absolutely no LEDs, not even in the LAN port, and on/off switch is a button, so unless you ping it over the network or hook up a display, good luck guessing if this thing is on.
All ports are in the back, there is no Display Port, only a single HDMI port, USB-C (power only), 3x USB-C 3.2 gen 2 ports, 10G ethernet port and 2x QSFP ports.

The air intake is in the front and exhaust is in the back. It is quiet for the most part, but the fan is quite audible when it’s on (but quieter than my GMKTek).

It has a single 4TB PciE 5.0x4 M.2 2242 SSD - SAMSUNG MZALC4T0HBL1-00B07 which I couldn’t find anywhere for sale in 2242 form factor, only 2280 version, but DGX Spark only takes 2242 drives. I wish they went with standard 2280 - weird decision, given that it’s a mini-PC, not a laptop or tablet. Who cares if the motherboard is an inch longer!

The performance seems good, and gives me 4240.64 MB/sec vs 3118.53 MB/sec on my GMKTek (as measured by hdparm).

It is user replaceable, but there is only one slot, accessible from the bottom of the device. You need to take the magnetic plate off and there are some access screws underneath.

The unit is made of metal, and gets quite hot during high loads, but not unbearable hot like some reviews mentioned. Cools down quickly, though (metal!).

The CPU is 20 core ARM with 10 performance and 10 efficiency cores. I didn’t benchmark them, but other reviews CPU show performance similar to Strix Halo.

Initial Setup

DGX Spark comes with DGX OS pre-installed (more on this later). You can set it up interactively using keyboard/mouse/display or in headless mode via WiFi hotspot that it creates.

I tried to set it up by connecting my trusted Logitech keyboard/trackpad combo that I use to set up pretty much all my server boxes, but once it booted up, it displayed “Connect the keyboard” message and didn’t let me proceed any further. Trackpad portion worked, and volume keys on the keyboard also worked! I rebooted, and was able to enter BIOS (by pressing Esc) just fine, and the keyboard was fully functioning there!

BTW, it has AMI BIOS, but doesn’t expose anything interesting other than networking and boot options.

Booting into DGX OS resulted in the same problem. After some googling, I figured that it shipped with a borked kernel that broke Logitech unified setups, so I decided to proceed in a headless mode.

Connected to the Wifi hotspot from my Mac (hotspot SSID/password are printed on a sticker on top of the quick start guide) and was able to continue set up there, which was pretty smooth, other than Mac spamming me with “connect to internet” popup every minute or so. It then proceeded to update firmware and OS packages, which took about 30 minutes, but eventually finished, and after that my Logitech keyboard worked just fine.

Linux Experience

DGX Spark runs DGX OS 7.2.3 which is based on Ubuntu 24.04.3 LTS, but uses NVidia’s custom kernel, and an older one than mainline Ubuntu LTS uses.
So instead of 6.14.x you get 6.11.0-1016-nvidia.

It comes with CUDA 13.0 development kit and NVidia drivers (580.95.05) pre-installed.
It also has NVidia’s container toolkit that includes docker, and GPU passthrough works well.

Other than that, it’s a standard Ubuntu Desktop installation, with GNOME and everything.

SSHd is enabled by default, so after headless install you can connect to it immediately without any extra configuration.

RDP remote desktop doesn’t work currently - it connects, but display output is broken.

I tried to boot from Fedora 43 Beta Live USB, and it worked, sort of. First, you need to disable Secure Boot in BIOS. Then, it boots only in “basic graphics mode”, because built-in nvidia drivers don’t recognize the chipset. It also throws other errors complaining about chipset, processor cores, etc.

I think I’ll try to install it to an external SSD and see if NVidia standard drivers will recognize the chip. There is hope:

 ==============
 PLATFORM INFO:
 ==============
 IOMMU: Pass-through or enabled
 Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
 Cuda Driver Version Installed:  13000
 Platform: NVIDIA_DGX_Spark, Arch: aarch64(Linux 6.11.0-1016-nvidia)
 Platform verification succeeded

As for Strix Halo, it’s an x86 PC, so you can run any distro you want. I chose Fedora 43 Beta, currently running with kernel 6.17.3-300.fc43.x86_64. Smooth sailing, up-to-date packages.

Llama.cpp experience

DGX Spark

You need to build it from source as there is no CUDA ARM build, but compiling llama.cpp was very straightforward - CUDA toolkit is already installed, just need to install development tools and it compiles just like on any other system with NVidia GPU. Just follow the instructions, no surprises.

However, when I ran the benchmarks, I ran into two issues.

  1. The model loading was VERY slow. It took 1 minute 40 seconds to load gpt-oss-120b. For comparison, it takes 22 seconds to load on Strix Halo (both from cold, memory cache flushed).
  2. I wasn’t getting the same results as ggerganov in this thread. While PP was pretty impressive for such a small system, TG was matching or even slightly worse than my Strix Halo setup with ROCm.

For instance, here are my Strix Halo numbers, compiled with ROCm 7.10.0a20251017, llama.cpp build 03792ad9 (6816), HIP only, no rocWMMA:

build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 999.59 ± 4.31
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 47.49 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d4096 824.37 ± 1.16
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d4096 44.23 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d8192 703.42 ± 1.54
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d8192 42.52 ± 0.04
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d16384 514.89 ± 3.86
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d16384 39.71 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d32768 348.59 ± 2.11
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d32768 35.39 ± 0.01

The same command on Spark gave me this:

model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 1816.00 ± 11.21
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 44.74 ± 0.99
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d4096 1763.75 ± 6.43
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d4096 42.69 ± 0.93
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d8192 1695.29 ± 11.56
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d8192 40.91 ± 0.35
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d16384 1512.65 ± 6.35
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d16384 38.61 ± 0.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d32768 1250.55 ± 5.21
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d32768 34.66 ± 0.02

I tried enabling Unified Memory switch (GGML_CUDA_ENABLE_UNIFIED_MEMORY=1) - it improved model loading, but resulted in even worse performance.

I reached out to ggerganov, and he suggested disabling mmap. I thought I tried it, but apparently not.
Well, that fixed it. Model loading improved too - now taking 56 seconds from cold and 23 seconds when it’s still in cache.

Updated numbers:

model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 1939.32 ± 4.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 56.33 ± 0.26
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d4096 1832.04 ± 5.58
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d4096 52.63 ± 0.12
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d8192 1738.07 ± 5.93
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d8192 48.60 ± 0.20
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d16384 1525.71 ± 12.34
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d16384 45.01 ± 0.09
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d32768 1242.35 ± 5.64
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d32768 39.10 ± 0.09

As you can see, much better performance both in PP and TG.

As for Strix Halo, mmap/no-mmap doesn’t make any difference there.

Strix Halo

On Strix Halo, llama.cpp experience is… well, a bit turbulent.

You can download a pre-built version for Vulkan, and it works, but the performance is a mixed bag. TG is pretty good, but PP is not great.

build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 --mmap 0 -ngl 999 -ub 1024

NOTE: Vulkan likes batch size of 1024 the most, unlike ROCm that likes 2048 better.

model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Vulkan pp2048 526.54 ± 4.90
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Vulkan tg32 52.64 ± 0.08
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Vulkan pp2048 @ d4096 438.85 ± 0.76
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Vulkan tg32 @ d4096 48.21 ± 0.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Vulkan pp2048 @ d8192 356.28 ± 4.47
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Vulkan tg32 @ d8192 45.90 ± 0.23
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Vulkan pp2048 @ d16384 210.17 ± 2.53
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Vulkan tg32 @ d16384 42.64 ± 0.07
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Vulkan pp2048 @ d32768 138.79 ± 9.47
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Vulkan tg32 @ d32768 36.18 ± 0.02

I tried toolboxes from kyuz0, and some of them were better, but I still felt that I could squeeze more juice out of it. All of them suffered from significant performance degradation when the context was filling up.

Then I tried to compile my own using the latest ROCm build from TheRock (on that date).

I also build rocWMMA as recommended by kyoz0 (more on that later).

Llama.cpp compiled without major issues - I had to configure the paths properly, but other than that, it just worked.
The PP increased dramatically, but TG decreased.

model size params backend ngl n_ubatch fa mmap test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp2048 1030.71 ± 2.26
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 tg32 47.84 ± 0.02
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp2048 @ d4096 802.36 ± 6.96
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 tg32 @ d4096 39.09 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp2048 @ d8192 615.27 ± 2.18
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 tg32 @ d8192 33.34 ± 0.05
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp2048 @ d16384 409.25 ± 0.67
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 tg32 @ d16384 25.86 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp2048 @ d32768 228.04 ± 0.44
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 tg32 @ d32768 18.07 ± 0.03

But the biggest issue is significant performance degradation with long context, much more than you’d expect.

Then I stumbled upon Lemonade SDK and their pre-built llama.cpp. Ran that one, and got much better results across the board. TG was still below Vulkan, but PP was decent and degradation wasn’t as bad:

model size params test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 999.20 ± 3.44
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg32 47.53 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 @ d4096 826.63 ± 9.09
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg32 @ d4096 44.24 ± 0.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 @ d8192 702.66 ± 2.15
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg32 @ d8192 42.56 ± 0.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 @ d16384 505.85 ± 1.33
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg32 @ d16384 39.82 ± 0.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 @ d32768 343.06 ± 2.07
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg32 @ d32768 35.50 ± 0.02

So I looked at their compilation options and noticed that they build without rocWMMA. So, I did the same and got similar performance too!

model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 1000.93 ± 1.23
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 47.46 ± 0.02
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d4096 827.34 ± 1.99
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d4096 44.20 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d8192 701.68 ± 2.36
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d8192 42.39 ± 0.04
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d16384 503.49 ± 0.90
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d16384 39.61 ± 0.02
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm pp2048 @ d32768 344.36 ± 0.80
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm tg32 @ d32768 35.32 ± 0.01

So far that’s the best I could get from Strix Halo. It’s very usable for text generation tasks.

Also, wanted to touch multi-modal performance. That’s where Spark shines. I don’t have any specific benchmarks yet, but image processing is much faster on Spark than on Strix Halo, especially in vLLM.

VLLM Experience

Haven’t had a chance to do extensive testing here, but wanted to share some early thoughts.

DGX Spark

First, I tried to just build vLLM from the source as usual. The build was successful, but it failed with the following error: ptxas fatal : Value ‘sm_121a’ is not defined for option ‘gpu-name’

I decided not to spend too much time on this for now, and just launched vLLM container that NVidia provides through their Docker repository.
It is built for DGX Spark, so supports it out of the box.

However, it has version 0.10.1, so I wasn’t able to run Qwen3-VL there.

Now, they put the source code inside the container, but it wasn’t a git repository - probably contains some NVidia-specific patches - I’ll need to see if those could be merged into main vllm code.

So I just checked out vllm main branch and proceeded to build with existing pytorch as usual. This time I was able to run it and launch qwen3-vl models just fine.
Both dense and MOE work. I tried FP4 and AWQ quants - everything works, no need to disable CUDA graphs.

The performance is decent - I still need to run some benchmarks, but image processing is very fast.

Strix Halo

Unlike llama.cpp that just works, vLLM experience on Strix Halo is much more limited.

My goal was to run Qwen3-VL models that are not supported by llama.cpp yet, so I needed to build 0.11.0 or later. There are some existing containers/toolboxes for earlier versions, but I couldn’t use them.

So, I installed ROCm pyTorch libraries from TheRock, some patches from kyoz0 toolboxes to avoid amdsmi package crash, ROCm FlashAttention and then just followed vLLM standard installation instructions with existing pyTorch.

I was able to run Qwen3VL dense models with decent (for dense models) speeds, although initialization takes quite some time until you reduce -max-num-seqs to 1 and set tp 1.
The image processing is very slow though, much slower than llama.cpp for the same image, but the token generation is about what you’d expect from it.

Again, model loading is faster than Spark for some reason (I’d expect other way around given faster SSD in Spark and slightly faster memory).

I’m going to rebuild vLLM and re-test/benchmark later.

Some observations:

  • FP8 models don’t work - they hang on WARNING 10-22 12:55:04 [fp8_utils.py:785] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /home/eugr/vllm/vllm/vllm/model_executor/layers/quantization/utils/configs/N=6144,K=2560,device_name=Radeon_8060S_Graphics,dtype=fp8_w8a8,block_shape=[128,128].json
  • You need to use --enforce-eager, as CUDA graphs crash vLLM. Sometimes it works, but mostly crashes.
  • Even with --enforce-eager, there are some HIP-related crashes here and there occasionally.
  • AWQ models work, both 4-bit and 8-bit, but only dense ones. AWQ MOE quants require Marlin kernel that is not available for ROCm.

Conclusion / TL;DR

Summary of my initial impressions:

  • DGX Spark is an interesting beast for sure.
    • Limited extensibility - no USB-4, only one M.2 slot, and it’s 2242.
    • But has 200Gbps network interface.
  • It’s a first generation of such devices, so there are some annoying bugs and incompatibilities.
  • Inference wise, the token generation is nearly identical to Strix Halo both in llama.cpp and vllm, but prompt processing is 2-5x higher than Strix Halo.
    • Strix Halo performance in prompt processing degrades much faster with context.
    • Image processing takes longer, especially with vLLM.
    • Model loading into unified RAM is slower on DGX Spark for some reason, both in llama.cpp and vLLM.
  • Even though vLLM included gfx1151 in the supported configurations, it still requires some hacks to compile it.
    • And even then, the experience is suboptimal. Initialization time is slow, it crashes, FP8 doesn’t work, AWQ for MOE doesn’t work.
  • If you are an AI developer who uses transformers/pyTorch or you need vLLM - you are better off with DGX Spark (or just a normal GPU build).
  • If you want a power-efficient inference server that can run gpt-oss and similar MOE at decent speeds, and don’t need to process images often, Strix Halo is the way to go.
  • If you want a general purpose machine, Strix Halo wins too.

DGX Spark vs Strix Halo …that’s 🍎🍊. CUDA or bust!

AMD has ROCm. Not a “real” CUDA, but PyTorch works.
Of course, CUDA is better supported, although GB10 and Blackwell in general is not exactly trouble-free yet.

This post is indexed by google pretty highly for “dgx spark vs strix halo”, I was wondering does the conclusion still hold true after 3 months of software updates on the DGX spark ecosystem?

I am particularly interested in the local inference workload only use case.

Thanks

Still true, but at this point I’d recommend DGX Spark over Strix Halo unless the money is a concern:

  • vLLM support is still bad on Strix Halo, so you are pretty much limited to llama.cpp there.
  • Prompt processing speed is much higher on DGX Spark, especially if using vLLM. Like 5x higher, even more on longer contexts.
  • Model loading speed has been improved on Spark, unless you use MMAP - that one is still not great, but you can use --no-mmap with llama.cpp and --load-format fastsafetensors with vLLM.
  • 200G networking is a MAJOR feature of Spark. Two Spark cluster can lead to almost 2x gains in inference speeds with dense models and lower, but still noticeable gains for MoE, and you can unlock larger models as a result. You are not limited to 2 Sparks either, some people here have 8x Spark clusters now.
  • Strix Halo machines increased in price, while OG Spark stays the same and OEM ones can be had for less money.

I still have both, but I use my dual Spark cluster for pretty much anything - Strix Halo machine performs some LLM stuff for my home automation pipelines now.