Nemotron-3-Super NVFP4 via vLLM TP=2 on 2x DGX Spark — 24 tok/s (ABI fix for cu130/cu132 mismatch)

EDIT: looks like this is resolved by a new prebuilt wheel .dev176 so the original issue doesn’t need any further changes, just clearing wheel cache so new nightly is pulled.

Hello!

I’ve been trying to get Nemotron-3-Super running on a dual dgx spark setup and wanted to share what I found since I saw several other posts / comments about the same issue.

Note: I used Claude to help generate the rest of the message with the information that fixed running nemotron-3-super with TP=2 on my dual DGX spark stack.

**TL;DR**: The `_ZN3c1013MessageLoggerC1E` crash that’s been hitting people building with `spark-vllm-docker` is a cu130/cu132 mismatch in the Dockerfile. Two-line fix, PR submitted, and Nemotron Super NVFP4 is now serving at 24 tok/s via vLLM TP=2.

The Setup

- 2x DGX Spark GB10, ConnectX-7 direct connect (200Gbps)

- `spark-vllm-docker` (eugr’s repo) with the `nemotron-3-super-nvfp4` recipe

- vLLM 0.18.1rc1 from the prebuilt wheels

What Was Broken

Building `vllm-node` from the latest main branch and running any recipe gives:

ImportError: vllm/_C.abi3.so: undefined symbol: _ZN3c1013MessageLoggerC1EPKciib

I saw eugr had been trying a few things (cuda 13.2 torch, revert, etc.) and figured I’d dig in to see if I could help since I really wanted the `–moe-backend cutlass` support for Nemotron Super’s LatentMoE architecture.

Root Cause

Demangled the symbol: `c10::MessageLogger::MessageLogger(char const*, int, int, bool)`. That’s a PyTorch core library constructor.

The prebuilt vLLM wheel filename tells you everything: `vllm-0.18.1rc1.dev121+gcd7643015.d20260325.**cu132**`

But the Dockerfile installs PyTorch from:

--index-url /whl/nightly/**cu130**

cu132 wheel + cu130 PyTorch = different `libc10.so` ABI = symbol not found.

The Fix

Change `cu130` → `cu132` on lines 48 and 259 of the Dockerfile. The one catch is that `torchvision` and `torchaudio` don’t publish cu132 aarch64 nightlies, so you have to split the install:

# torch from cu132 (must match the prebuilt vLLM wheel)

uv pip install --prerelease=allow torch --index-url /whl/nightly/cu132 && \

# torchvision/torchaudio: try cu132, fall back to cu130

uv pip install --prerelease=allow torchvision torchaudio triton \

–index-url /whl/nightly/cu132 \

–extra-index-url /whl/nightly/cu130

PR: fix: cu130 → cu132 PyTorch index to match prebuilt vLLM wheel ABI by brendangibat · Pull Request #141 · eugr/spark-vllm-docker · GitHub

Results

Ollama (1 Spark) vLLM NVFP4 TP=2 (2 Sparks)
Quantization Q4_K_M GGUF NVFP4 (modelopt_mixed)
Generation 18 tok/s **24 tok/s**
Context 256K 262K (1M native)
Tool calling Ollama API OpenAI API + `–enable-auto-tool-choice`

The NVFP4 quality is noticeably better than Q4_K_M too — getting cleaner code output with proper docstrings and fewer hallucinations.

Workflow That Helped

For anyone with 2 Sparks — I downloaded the HuggingFace NVFP4 weights (~75 GB) on spark1 only, then copied to spark2 over the CX7 link. rsync did 75 GB in about 2 minutes vs 90+ minutes from HuggingFace on each node. Way better than downloading on both in parallel.

Hope this helps someone else get unstuck. Happy to answer questions about the setup.

What I ran for evaluating the accuracy and tps - not a strict perf test but local testing with requests I’ve been building up or using for evaluation at the moment.

I am still new to managing a lot of the internals of the LLM’s on sparks, especially across multiple machines.

This forum has been a great source of info so I wanted to share back something I did to help fix a problem.