Nemotron-3-Super NVFP4 via vLLM TP=2 on 2x DGX Spark — 24 tok/s (ABI fix for cu130/cu132 mismatch)

leon-gibat · March 26, 2026, 5:09pm

EDIT: looks like this is resolved by a new prebuilt wheel .dev176 so the original issue doesn’t need any further changes, just clearing wheel cache so new nightly is pulled.

Hello!

I’ve been trying to get Nemotron-3-Super running on a dual dgx spark setup and wanted to share what I found since I saw several other posts / comments about the same issue.

Note: I used Claude to help generate the rest of the message with the information that fixed running nemotron-3-super with TP=2 on my dual DGX spark stack.

**TL;DR**: The `_ZN3c1013MessageLoggerC1E` crash that’s been hitting people building with `spark-vllm-docker` is a cu130/cu132 mismatch in the Dockerfile. Two-line fix, PR submitted, and Nemotron Super NVFP4 is now serving at 24 tok/s via vLLM TP=2.

The Setup

- 2x DGX Spark GB10, ConnectX-7 direct connect (200Gbps)

- `spark-vllm-docker` (eugr’s repo) with the `nemotron-3-super-nvfp4` recipe

- vLLM 0.18.1rc1 from the prebuilt wheels

What Was Broken

Building `vllm-node` from the latest main branch and running any recipe gives:

ImportError: vllm/_C.abi3.so: undefined symbol: _ZN3c1013MessageLoggerC1EPKciib

I saw eugr had been trying a few things (cuda 13.2 torch, revert, etc.) and figured I’d dig in to see if I could help since I really wanted the `–moe-backend cutlass` support for Nemotron Super’s LatentMoE architecture.

Root Cause

Demangled the symbol: `c10::MessageLogger::MessageLogger(char const*, int, int, bool)`. That’s a PyTorch core library constructor.

The prebuilt vLLM wheel filename tells you everything: `vllm-0.18.1rc1.dev121+gcd7643015.d20260325.**cu132**`

But the Dockerfile installs PyTorch from:

--index-url /whl/nightly/**cu130**

cu132 wheel + cu130 PyTorch = different `libc10.so` ABI = symbol not found.

The Fix

Change `cu130` → `cu132` on lines 48 and 259 of the Dockerfile. The one catch is that `torchvision` and `torchaudio` don’t publish cu132 aarch64 nightlies, so you have to split the install:

# torch from cu132 (must match the prebuilt vLLM wheel)

uv pip install --prerelease=allow torch --index-url /whl/nightly/cu132 && \

# torchvision/torchaudio: try cu132, fall back to cu130

uv pip install --prerelease=allow torchvision torchaudio triton \

–index-url /whl/nightly/cu132 \

–extra-index-url /whl/nightly/cu130

PR: fix: cu130 → cu132 PyTorch index to match prebuilt vLLM wheel ABI by brendangibat · Pull Request #141 · eugr/spark-vllm-docker · GitHub

Results

	Ollama (1 Spark)	vLLM NVFP4 TP=2 (2 Sparks)
Quantization	Q4_K_M GGUF	NVFP4 (modelopt_mixed)
Generation	18 tok/s	24 tok/s
Context	256K	262K (1M native)
Tool calling	Ollama API	OpenAI API + `–enable-auto-tool-choice`

The NVFP4 quality is noticeably better than Q4_K_M too — getting cleaner code output with proper docstrings and fewer hallucinations.

Workflow That Helped

For anyone with 2 Sparks — I downloaded the HuggingFace NVFP4 weights (~75 GB) on spark1 only, then copied to spark2 over the CX7 link. rsync did 75 GB in about 2 minutes vs 90+ minutes from HuggingFace on each node. Way better than downloading on both in parallel.

Hope this helps someone else get unstuck. Happy to answer questions about the setup.

leon-gibat · March 26, 2026, 5:22pm

What I ran for evaluating the accuracy and tps - not a strict perf test but local testing with requests I’ve been building up or using for evaluation at the moment.

I am still new to managing a lot of the internals of the LLM’s on sparks, especially across multiple machines.

This forum has been a great source of info so I wanted to share back something I did to help fix a problem.

Topic		Replies	Views
Help running Nemotron 3 Nano 30B-A3B-FP8 on DGX Spark (GB10) DGX Spark / GB10 spark , nim , nemotron	42	2977	February 7, 2026
DGX Spark, Nemotron3, and NVFP4: Getting to 65+ tps DGX Spark / GB10 spark , nemotron , dgx	14	1656	December 22, 2025
NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 DGX Spark / GB10 nemotron	89	7619	March 31, 2026
Nemotron-3-Super 120B on GB10 — llama.cpp sm_121 build + Ollama GGUF incompatibility fix DGX Spark / GB10 Projects llama , nemotron	3	678	March 22, 2026
Running nvidia/nemotron-3-super on DGX spark DGX Spark / GB10 nemotron	13	734	March 27, 2026
Testing Nemotron 3 Nano Models on Nvidia DGX Spark/Jetson Thor with vLLM and FlashInfer DGX Spark / GB10 jetson , nemotron	3	417	February 15, 2026
Nemotron-3-Nano-30B-A3B-NVFP4 ultra-efficient NVFP4 precision version of Nemotron 3 Nano DGX Spark / GB10 jetson , nemotron	84	2683	March 20, 2026
OpenClaw w/ Nemotron-3-Super NVFP4 TensorRT inference on Spark Discussion DGX Spark / GB10 nemotron	13	1052	April 2, 2026
Running Nemotron 3 Super 120B on DGX Spark GB10— 72 hours continuous, 19 tok/s NVIDIA Nemotron llama , nemotron	3	94	March 28, 2026
Getting Nemotron embed working on DGX Spark DGX Spark / GB10 spark , nv-embed-v1 , llama , nemotron	3	225	February 3, 2026

Nemotron-3-Super NVFP4 via vLLM TP=2 on 2x DGX Spark — 24 tok/s (ABI fix for cu130/cu132 mismatch)

Related topics