DGX Spark Multi-Node LLM Inference Report for Qwen3-235B model

baristankut · December 18, 2025, 3:09pm

DGX Spark Multi-Node LLM Inference Report

Date: December 17, 2025
System: 2x NVIDIA DGX Spark (GB10 GPU - Blackwell SM121)
Goal: Run Qwen3-235B model with multi-node distributed inference

CRITICAL FINDING: Native Solution Failed, Workaround Used

NVIDIA’s native multi-node inference stack designed for DGX Spark is NOT ready for GB10/SM121. The working solution in this report was achieved through a workaround, not the intended native path.

Native vs Workaround Comparison

Method	Status	Issue	Performance Impact
vLLM + Ray (native tensor parallelism)	FAILED	GB10 not recognized as `GPU` resource	-
TensorRT-LLM + NVFP4 (native NVIDIA stack)	FAILED	SM121 GEMM kernels missing	-
llama.cpp + RPC (workaround)	WORKING	Uses TCP/IP, not NCCL	~1-2μs extra latency

Expected Native Flow (Did NOT Work):

vLLM → Ray → NCCL → NVLink/ConnectX-7 → Native Tensor Parallelism
                    (39.34 GB/s)

Workaround Flow Used:

llama.cpp → RPC → TCP/IP → Manual Layer Splitting
                  (added latency)

Performance Implications

NCCL test: 39.34 GB/s throughput (hardware is working)
RPC backend: Running over TCP/IP, NOT using NCCL
Potential loss: Native solution could have been estimated 2-3x faster
Current performance: 12.5 t/s (better than NVIDIA’s 11.73 t/s benchmark, but below native potential)

Note to NVIDIA: vLLM/Ray integration for GB10 GPUs and SM121 NVFP4 kernel support should be critical priorities. The hardware can deliver 39 GB/s NCCL throughput, but the software stack cannot utilize it.

1. System Specifications

Hardware

Node	IP (QSFP)	GPU	GPU Memory	CPU	RAM
dgxnode1	169.254.1.1	NVIDIA GB10	128GB UMA (~117GB usable)	20 core	119GB
dgxnode2	169.254.1.2	NVIDIA GB10	128GB UMA (~117GB usable)	20 core	119GB

Network

Connection: QSFP 200GbE direct cable
MTU: 9000 (Jumbo frames)
Subnet: 169.254.0.0/16 (link-local)

Software Environment

OS: Ubuntu 24.04 (Linux 6.14.0-1015-nvidia)
CUDA: 13.0+ (SM121 Blackwell support)
Driver: NVIDIA Container Toolkit

2. SUCCESSFUL OPERATIONS

2.1 NCCL Multi-Node Communication

Status: SUCCESS

NCCL version: 2.28.9-1
Test: nccl_message_transfer (all_reduce)
Performance: 39.34 GB/s throughput

Steps Taken:

Configured MTU 9000 (jumbo frames)
Set up Docker network configuration
Configured NCCL environment variables
Verified with nccl-tests

2.2 llama.cpp Build (CUDA + RPC Support)

Status: SUCCESS

Build Commands:

git clone --depth 1 https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DGGML_RPC=ON -DLLAMA_CURL=OFF
cmake --build build --config Release -j$(nproc)

Result:

llama-server, llama-cli, rpc-server binaries created
Compiled with SM121 (Blackwell) support
Copied to both nodes (rsync)

2.3 spark-vllm-docker Build

Status: SUCCESS

Duration: 50 minutes 30 seconds
Image: vllm-spark:latest (23.4GB)

Features:

Based on vLLM v0.12.0
Compiled with SM121 CUDA kernels
NVFP4 and AWQ quantization support
Optimized with Triton compiler

2.4 Model Download (Qwen3-235B Q4_K_XL)

Status: SUCCESS

Model: unsloth/Qwen3-235B-A22B-GGUF (UD-Q4_K_XL quantization)
Size: 134GB (3 split files)
Duration: ~20 minutes (with hf_transfer)

Files:

/home/user/models/UD-Q4_K_XL/
├── Qwen3-235B-A22B-UD-Q4_K_XL-00001-of-00003.gguf (47GB)
├── Qwen3-235B-A22B-UD-Q4_K_XL-00002-of-00003.gguf (47GB)
└── Qwen3-235B-A22B-UD-Q4_K_XL-00003-of-00003.gguf (33GB)

Download Optimizations:

Enabled hf_transfer library
HuggingFace token authentication
HF_HUB_ENABLE_HF_TRANSFER=1 environment variable

2.5 Single GPU Test (dgxnode1)

Status: SUCCESS

Performance: ~1.8 tokens/sec

Command:

./llama-server -m "$MODEL" -ngl 999 --host 0.0.0.0 --port 8082 -c 2048

Results:

95/95 layers loaded to GPU
CUDA0 buffer: 115GB
Model running entirely in GPU memory

2.6 Multi-Node RPC Test (2x DGX)

Status: SUCCESS

Performance: ~12.5 tokens/sec (7x speedup!)

Command:

./llama-server \
  -m "$MODEL" \
  --rpc "169.254.1.2:50052" \
  -ngl 999 \
  -fit off \
  --host 0.0.0.0 \
  --port 8082 \
  -c 2048

Memory Distribution:

CUDA0 (dgxnode1): 63GB
RPC0 (dgxnode2): 64.5GB
CPU Mapped: 334MB

API Test:

{
  "prompt_per_second": 37.73,
  "predicted_per_second": 12.50,
  "total_tokens": 164
}

3. FAILED / PROBLEMATIC OPERATIONS

3.1 vLLM Ray Distributed Backend

Status: FAILED
Error: “Current node has no GPU available”

Root Cause:
Ray cluster registers GPUs as accelerator_type:GB10, but vLLM v1 engine expects the GPU resource key. This is a resource mapping issue.

Details:

Ray Node Resources:
- CPU: 20.0
- memory: 68GB
- accelerator_type:GB10: 1.0
- GPU: (MISSING!)  <-- This is the problem

Attempted Solutions:

VLLM_USE_V1=0 (legacy engine) - Did not work
Ray cluster restart - Did not work
Placement group cleanup - Did not work

Potential Fixes:

Force GPU detection via CUDA_VISIBLE_DEVICES
Patch vLLM’s Ray resource detection code
Test if spark-vllm-docker image resolves this issue

Recommendation for NVIDIA/vLLM Team:
The GB10 GPU is not being recognized as a standard GPU resource in Ray. vLLM’s worker initialization fails because it looks for GPU resource type, but Ray only registers accelerator_type:GB10. This needs either:

Ray to also register a generic GPU resource for GB10
vLLM to recognize accelerator_type:GB10 as a valid GPU resource

3.2 NVFP4 Quantization

Status: FAILED
Error: “Failed to initialize GEMM Plugin”

Root Cause:
NVFP4 kernels are written for SM90 (Hopper) and do not support SM121 (Blackwell).

Details:

NVFP4 FP8 GEMM kernel not found
No compatible kernel for SM121

Recommendation for NVIDIA Team:
SM121 (Blackwell/GB10) needs NVFP4 GEMM kernel support. Currently only SM90 (Hopper) kernels are available in TensorRT-LLM and vLLM.

Workaround Used: AWQ quantization (32% faster than NVFP4 on DGX Spark anyway)

3.3 llama.cpp RPC Segmentation Fault

Status: RESOLVED (with workaround)

Error: Segfault during “fitting params to device memory”

Root Cause:
llama.cpp’s automatic memory fitting algorithm is incompatible with the RPC backend.

Solution:
Added -fit off parameter to disable automatic fitting.

# Does NOT work:
./llama-server -m $MODEL --rpc "..." -ngl 999

# WORKS:
./llama-server -m $MODEL --rpc "..." -ngl 999 -fit off

Recommendation for llama.cpp Team:
The automatic memory fitting feature crashes when RPC backend is enabled. Consider adding RPC-aware memory fitting or documenting this limitation.

3.4 dgxnode2 Docker GPU Access

Status: RESOLVED

Error: “Failed to initialize NVML: Unknown Error”

Root Cause:
/etc/docker/daemon.json file was missing on dgxnode2 (NVIDIA Container Toolkit not configured).

Solution:

# Copied from dgxnode1 to dgxnode2
scp /etc/docker/daemon.json dgxnode2:/etc/docker/
ssh dgxnode2 "systemctl daemon-reload && systemctl restart docker"

daemon.json contents:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

Note: This may be a DGX Spark setup issue - the second node did not have Docker properly configured for GPU access out of the box.

3.5 aria2c Download Issue

Status: RESOLVED

Problem:
When starting download with aria2c, there was incompatibility with previous hf download data. Files appeared as sparse.

Solution:
Used hf download with hf_transfer instead of aria2c.

4. PERFORMANCE COMPARISON

Configuration	Tokens/sec	Notes
CPU Only (single node)	~0.1	Too slow, not practical
Single GPU (dgxnode1)	~1.8	Entire model 115GB in GPU
Multi-node (2x DGX RPC)	~12.5	7x speedup

Prompt Processing: 37.7 tokens/sec
Token Generation: 12.5 tokens/sec

5. CURRENT WORKING STATE

Running Services

Service	Host	Port	Status
llama-server	dgxnode1	8082	RUNNING
rpc-server	dgxnode1	50052	RUNNING
rpc-server	dgxnode2	50052	RUNNING

API Usage

curl http://dgxnode1:8082/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d @request.json

Example Request

{
  "model": "qwen3-235b",
  "messages": [{"role": "user", "content": "Hello!"}],
  "max_tokens": 100
}

6. NOTES FOR FUTURE UPDATES

Items to Check on Driver/Software Updates

vLLM Ray GPU Resource Issue
- vLLM version: 0.12.0
- Ray version: 2.x
- Issue: accelerator_type:GB10 vs GPU resource mapping
- May be fixed in newer versions
NVFP4 SM121 Support
- TensorRT-LLM and vLLM need SM121 NVFP4 kernel support
- May come with CUDA 13.x updates
llama.cpp RPC Stability
- -fit off workaround required
- May be fixed in newer versions

Recommended Configuration (Production)

# Start RPC Servers (on each node)
cd ~/llama.cpp/build/bin
export LD_LIBRARY_PATH=$PWD:$LD_LIBRARY_PATH
./rpc-server -H <NODE_IP> -p 50052

# Start llama-server (on master node)
./llama-server \
  -m /home/user/models/UD-Q4_K_XL/Qwen3-235B-A22B-UD-Q4_K_XL-00001-of-00003.gguf \
  --rpc "169.254.1.2:50052" \
  -ngl 999 \
  -fit off \
  --host 0.0.0.0 \
  --port 8082 \
  -c 4096

7. FILE LOCATIONS

File/Directory	Location
llama.cpp build	`/home/user/llama.cpp/build/bin/`
llama.cpp (dgxnode2)	`/home/user/llama.cpp-build/bin/`
Qwen3-235B model	`/home/user/models/UD-Q4_K_XL/`
spark-vllm-docker	`/home/user/spark-vllm-docker/`
vLLM image	`vllm-spark:latest`
Ray cluster script	`/home/user/run_cluster.sh`

8. CONCLUSION

Successfully ran Qwen3-235B (235 billion parameters) model on 2x DGX Spark (GB10 Blackwell GPU) system.

Key Achievements:

Multi-node distributed inference with llama.cpp RPC backend
12.5 tokens/sec performance (7x speedup compared to single GPU)
127GB model memory distributed across two GPUs

Outstanding Issues:

vLLM Ray backend issue (GPU resource mapping) - Needs fix from vLLM/Ray team
NVFP4 quantization support (SM121 kernel missing) - Needs fix from NVIDIA team

Recommendation:
For production environments, the llama.cpp RPC solution is stable and performant. For vLLM solution, newer versions need to be monitored for GB10/SM121 compatibility fixes.

9. SUMMARY FOR NVIDIA TEAM

Critical Issues Requiring Attention:

GB10 GPU Not Recognized as “GPU” Resource in Ray
- Impact: vLLM multi-node inference completely broken
- Workaround: None (had to use llama.cpp instead)
- Suggested Fix: Ensure Ray registers GB10 as both accelerator_type:GB10 AND generic GPU resource
Missing NVFP4 Kernels for SM121
- Impact: Cannot use NVFP4 quantization on DGX Spark
- Workaround: Use AWQ quantization
- Suggested Fix: Add SM121 GEMM kernels to TensorRT-LLM/vLLM
Docker daemon.json Missing on Second DGX Spark Node
- Impact: GPU not accessible in Docker containers
- Workaround: Manually copy config from first node
- Suggested Fix: Ensure NVIDIA Container Toolkit is properly configured on all nodes during DGX setup

10. TRACKING LINKS FOR UPDATES

The following GitHub issues and forum threads should be monitored for fixes to the problems encountered in this report.

Critical Priority - Must Watch

Link	Description	Status
vLLM #30163	NVFP4 on 2x DGX Spark - Exact same scenario	OPEN
vLLM #12614	“Current node has no GPU available” - Main issue	OPEN
llama.cpp #13083	Tensor Parallelism over RPC - Would give 2-3x speedup	OPEN

vLLM + Ray GPU Detection Issues

Link	Description
vLLM #13093	Ray distributed “no GPU” error
vLLM #14109	Fractional GPU resource name issue
Ray #59064	Ray Serve + vLLM v1 placement group conflict

NVIDIA Forum - DGX Spark / GB10 Issues

Link	Description
Two Sparks Does Not Work	Multi-node vLLM issue
NIM Containers Fail on SM121	Triton/vLLM SM121 crash
vLLM Container Issue	Container problems
vLLM Forums - DGX Spark	vLLM forum discussion

TensorRT-LLM / NVFP4 SM121 Support

Link	Description
TensorRT-LLM Releases	Watch for SM121 kernel support
TensorRT-LLM #3591	Blackwell + FP4 issue
TensorRT-LLM #5018	RTX 5090 NVFP4 support
Support Matrix	Official supported GPUs

llama.cpp RPC Improvements

Link	Description
llama.cpp #9086	Tensor Parallelism support
llama.cpp #15463	RPC dual-node 50% GPU utilization bug
DGX Spark Discussion	DGX Spark benchmarks

Release Pages (Check Weekly)

Link	Description
llama.cpp Releases	New versions
vLLM Releases	vLLM updates
TensorRT-LLM Releases	TRT-LLM updates
DGX Spark Forum	Main DGX Spark forum

Report Date: December 17, 2025
System: dgxnode1 + dgxnode2 (2x DGX Spark)
Author: Testing multi-node LLM inference on DGX Spark cluster

NVES · December 18, 2025, 3:17pm

appreciate this detailed writeup. we will attempt local repro and communicate with eng as needed. thank you

eugr · December 18, 2025, 6:29pm

This is a very long LLM-produced report, and it’s hard to read through it, but dual spark inference works just fine with vLLM and SGLang if you do it properly.

I’m not sure if spark-vllm-docker build was referencing my repository below, but it has no issues running models in a distributed fashion. I suggest you use an AWQ quant for now (e.g. QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ) as NVFP4 is still not optimized on Spark.

eugr · December 18, 2025, 6:33pm

Your GPU issue with vLLM is probably related to how you launch Docker - make sure you use --gpus=all. You need to also ensure you use host networking and run container as privileged or pass infiniband device to it. The easiest way is to just follow instructions in my repository - just use your actual network interface names and IP addresses.

cyuen1 · December 18, 2025, 11:04pm

Hi baristankut,

We’re in the progress of communicating all of this with engineering; in this effort, I’d like to know the exact commands from beginning to end that you used to produce the Ray/vLLM and TensorRT-LLM quantization errors, and the exact error messages. Thank you for your contributions thus far!

baristankut · December 19, 2025, 5:49am

Yes, we used your spark-vllm-docker repo and successfully built it (took 50 minutes). Thanks for the AWQ recommendation - we got ~16.4 t/s with Qwen2.5-32B-AWQ on single node.

However, our issue was not with Docker or the image. Ray cluster was properly set up, both nodes showed 2 GPUs and 218GB memory. The issue was:

Ray Node Resources:

CPU: 20.0
memory: 68GB
accelerator_type:GB10: 1.0
GPU: (MISSING!) ← Problem here

Ray registered the GPU as accelerator_type:GB10 but vLLM v1 engine expects a GPU resource key. This is a resource mapping issue between Ray and vLLM.

baristankut · December 19, 2025, 5:50am

We did use --gpus all, --network host, --ipc=host, and --privileged. Here’s our exact Docker command:

docker run -d \
–name vllm-head \
–gpus all \
–network host \
–ipc=host \
–ulimit memlock=-1 \
-e NCCL_SOCKET_IFNAME=enp1s0f1np1 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
nvcr.io/nvidia/vllm:25.11-py3 \
bash -c “ray start --head --port=6379 --node-ip-address=169.254.254.1 && sleep infinity”

GPU was visible inside the container (nvidia-smi worked fine). The problem was specifically Ray’s resource registration for GB10 - it shows accelerator_type:GB10 but not the generic GPU resource that vLLM expects.

We also tried NVIDIA’s official run_cluster.sh script with same result.

baristankut · December 19, 2025, 5:50am

vLLM/Ray Exact Commands:

Head Node (dgxnode1):
export VLLM_IMAGE=nvcr.io/nvidia/vllm:25.11-py3
export MN_IF_NAME=enp1s0f1np1
export VLLM_HOST_IP=169.254.254.1

bash run_cluster.sh $VLLM_IMAGE $VLLM_HOST_IP --head ~/.cache/huggingface \
-e VLLM_HOST_IP=$VLLM_HOST_IP \
-e UCX_NET_DEVICES=$MN_IF_NAME \
-e NCCL_SOCKET_IFNAME=$MN_IF_NAME \
-e GLOO_SOCKET_IFNAME=$MN_IF_NAME \
-e MASTER_ADDR=$VLLM_HOST_IP

Worker Node (dgxnode2):
export MN_IF_NAME=enp1s0f1np1
export VLLM_HOST_IP=169.254.106.137
export HEAD_NODE_IP=169.254.254.1

bash run_cluster.sh $VLLM_IMAGE $HEAD_NODE_IP --worker ~/.cache/huggingface \
-e VLLM_HOST_IP=$VLLM_HOST_IP \
-e UCX_NET_DEVICES=$MN_IF_NAME \
-e NCCL_SOCKET_IFNAME=$MN_IF_NAME \
-e GLOO_SOCKET_IFNAME=$MN_IF_NAME \
-e MASTER_ADDR=$HEAD_NODE_IP

vLLM Serve Command:
docker exec -it $VLLM_CONTAINER vllm serve \
QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ \
–tensor-parallel-size 2 \
–distributed-executor-backend ray \
–gpu-memory-utilization 0.7 \
–max-model-len 32768

Ray Status Output (Successful):
Nodes: 2
Resources:
CPU: 40.0/40.0
memory: 218.88 GiB
accelerator_type:GB10: 2.0
object_store_memory: 19.45 GiB

vLLM Error:
ValueError: Current node has no GPU available.
current_node_resource={‘node:169.254.254.1’: 1.0, ‘CPU’: 20.0, ‘memory’: …}

Root Cause: Ray registers GB10 as accelerator_type:GB10 but vLLM v1 expects GPU resource key.

TensorRT-LLM/NVFP4 Commands:

docker run -d \
–name trtllm-multinode \
–gpus all \
–network host \
–ipc=host \
-e UCX_NET_DEVICES=“enp1s0f1np1” \
-e NCCL_SOCKET_IFNAME=“enp1s0f1np1” \
-v ~/models:/models \
nvcr.io/nvidia/tritonserver:25.04-trtllm-python-py3 \
sleep infinity

Inside container:

trtllm-serve nvidia/Qwen3-235B-A22B-FP4 --tp_size 2 --backend pytorch

NVFP4 Error:
Failed to initialize GEMM Plugin
NVFP4 FP8 GEMM kernel not found for SM121

Root Cause: NVFP4 GEMM kernels are compiled for SM90 (Hopper), not SM121 (Blackwell/GB10).

eugr · December 19, 2025, 6:15am

You said you used my Docker, but you are running NVIDIA one instead…

Can you try to follow instructions in my repo? I’ve even added a script that launches the cluster (and vllm command if needed) with interface autodetection, etc.

I don’t know what is causing your issue, but even if it worked, you’d be missing on performance, because you haven’t set IB devices, only management interface (ethernet).

Here is what it looks like when I run it on my system:

eugr@spark:~$ spark-vllm/launch-cluster.sh -t vllm-node-20251218 exec bash
Auto-detecting interfaces...
  Detected IB_IF: rocep1s0f1,roceP2p1s0f1
  Detected ETH_IF: enp1s0f1np1
Auto-detecting nodes...
  Detected Local IP: 192.168.177.11 (192.168.177.11/24)
  Scanning for SSH peers on 192.168.177.11/24...
  Found peer: 192.168.177.12
  Cluster Nodes: 192.168.177.11,192.168.177.12
Head Node: 192.168.177.11
Worker Nodes: 192.168.177.12
Container Name: vllm_node
Action: exec
Checking SSH connectivity to worker nodes...
  SSH to 192.168.177.12: OK
Starting Head Node on 192.168.177.11...
d844ae5bd989d81763be55d72ef715b187c534b91690a5909eac7825f3447998
Starting Worker Node on 192.168.177.12...
0f43d1941bbce6938377dd522b462d21f8e6d5af9363fd826fb8cc41a0cee7a3
Waiting for cluster to be ready...
Cluster head is responsive.
Executing command on head node: bash
root@spark:/workspace/vllm# ray status
======== Autoscaler status: 2025-12-19 06:13:35.616463 ========
Node status
---------------------------------------------------------------
Active:
 1 node_5266c0f87bdcf414fafb9690e2001f2030eea41d579758f645b865c4
 1 node_6695a1021780997d87830b40f5a1c92e6791f91d43c29ea0aa4ec97f
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0.0/40.0 CPU
 0.0/2.0 GPU
 0B/166.83GiB memory
 0B/71.50GiB object_store_memory

From request_resources:
 (none)
Pending Demands:
 (no resource demands

cyuen1 · December 19, 2025, 6:56pm

Thank you for these detailed steps! Will continue to communicate with engineering.

PrinceHal · December 19, 2025, 10:19pm

@baristankut I’m puzzled by your nccl speed. 39.34 GB/s x 8 bits/byte = 314.72 Gb/s. On a 200 GB/s C7X connection? And not counting overhead bytes?

How did you measure the 39.34?

A number of us have successfully used vLLM using @eugr ‘s repository (and also @mark440 ‘s, which shares some of the code).

alan.dang · December 19, 2025, 10:48pm

That is reporting the bidirectional total bandwidth.

PrinceHal · December 19, 2025, 11:32pm

Ah. Do you know how to obtain that measurement?

alan.dang · December 19, 2025, 11:58pm

I am not sure if you are asking about vLLM measurement or just the NCCL measurement. I wrote my own benchmark to play with. It’s rudimentary and probably better to enter in IP address yourself instead of scanning, but it basically says “If I can passwordless SSH into you, I have permissions” and then it will copy the AppImage into a nccl_benchmark directory on the other node. This is supposed to work on many nodes, but my code might be buggy and it might only work on 2 nodes.

like with anything, you can pick however you want to do the math. I posted an older version in the thread about GPUDirect being missing (you don’t really need it – it’s fast enough if you take advantage of unified RAM and are writing the stuff yourself; it is slower if you don’t write the code yourself).

It’s nice to see the difference between 10GbE and 200GB ConnectX-7. The DGX Spark is “expensive” if you don’t consider the value of the 200G interface and just use it for single box LLM inference.

NCCLBenchmark-aarch64.zip (35.5 MB)

raphael.amorim · December 20, 2025, 1:41am

Having a single spark is basically wasting $1500-$1700 infiniband module. Roughly 40% of the value.

PrinceHal · December 20, 2025, 1:50am

Thanks for the detailed reply. I was thinking about the nccl piece. For my purposes I simply use Nvidia’s gather_perf tool, which gives about 22 +/- GB/s. But I am a user rather than a producer of exotic programs.

alan.dang · December 20, 2025, 2:16am

My understanding is that gather_perf doesn’t count bidirectional, so if you were measuring point to point and it was bidirectional, it could be “reported” as 44GB/sec.

The main difference with my benchmark is that my coordinates directly via TCP instead of MPI, which might reduce overhead and since my original goal was to understand the impact of no GPUDirect, I also do various CPU to GPU and GPU1 to GPU2 type measurements in additional to the raw NCCL measurements to make sure nothing was falling back to a 10GbE connection.

PrinceHal · December 20, 2025, 2:28am

Have you read about Apple introducing RDMA into MacOS and how people are starting to use it to spread LLM work among several Macs at Thunderbolt 5 speeds? Any thoughts?

eugr · December 20, 2025, 5:04am

In NCCL perf tool it is reported as “alg bandwidth”. It’s literally just amount of data processed divided by time. Bus bandwidth tries to measure physical connection speed (unidirectional).

ericlewis777 · December 20, 2025, 2:07pm

Talking to a worker at microcenter last night about how annoyingly hard it is to find the qsfp cables and he told me that he wishes nvidia just bundled the cable because no one ever just buys one spark.

Topic		Replies	Views
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	1783	December 25, 2025
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	4104	December 9, 2025
Day 1 with DGX Spark (Asus version) DGX Spark / GB10	29	1592	February 7, 2026
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	711	February 13, 2026
Distributed Inference - 200gb/s with bottleneck, am I missing something? DGX Spark / GB10 llama	5	346	January 22, 2026
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	10	2403	January 25, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	1948	December 31, 2025
Question on Inference Performance Results of Qwen3 235B A22B on 2× DGX Spark DGX Spark / GB10 cuda	5	466	December 19, 2025
vLLM on dual sparks DGX Spark / GB10	4	557	December 1, 2025
Qwen3.5-397B-A17B + DGX Spark (duo) DGX Spark / GB10 Projects	42	1679	February 28, 2026

DGX Spark Multi-Node LLM Inference Report for Qwen3-235B model

DGX Spark Multi-Node LLM Inference Report

CRITICAL FINDING: Native Solution Failed, Workaround Used

Native vs Workaround Comparison

Expected Native Flow (Did NOT Work):

Workaround Flow Used:

Performance Implications

1. System Specifications

Hardware

Network

Software Environment

2. SUCCESSFUL OPERATIONS

2.1 NCCL Multi-Node Communication

2.2 llama.cpp Build (CUDA + RPC Support)

2.3 spark-vllm-docker Build

2.4 Model Download (Qwen3-235B Q4_K_XL)

2.5 Single GPU Test (dgxnode1)

2.6 Multi-Node RPC Test (2x DGX)

3. FAILED / PROBLEMATIC OPERATIONS

3.1 vLLM Ray Distributed Backend

3.2 NVFP4 Quantization

3.3 llama.cpp RPC Segmentation Fault

3.4 dgxnode2 Docker GPU Access

3.5 aria2c Download Issue

4. PERFORMANCE COMPARISON

5. CURRENT WORKING STATE

Running Services

API Usage

Example Request

6. NOTES FOR FUTURE UPDATES

Items to Check on Driver/Software Updates

Recommended Configuration (Production)

7. FILE LOCATIONS

8. CONCLUSION

9. SUMMARY FOR NVIDIA TEAM

Critical Issues Requiring Attention:

10. TRACKING LINKS FOR UPDATES

Critical Priority - Must Watch

vLLM + Ray GPU Detection Issues

NVIDIA Forum - DGX Spark / GB10 Issues

TensorRT-LLM / NVFP4 SM121 Support

llama.cpp RPC Improvements

Release Pages (Check Weekly)

Inside container:

Related topics