DGX Spark Multi-Node LLM Inference Report
Date: December 17, 2025
System: 2x NVIDIA DGX Spark (GB10 GPU - Blackwell SM121)
Goal: Run Qwen3-235B model with multi-node distributed inference
CRITICAL FINDING: Native Solution Failed, Workaround Used
NVIDIA’s native multi-node inference stack designed for DGX Spark is NOT ready for GB10/SM121. The working solution in this report was achieved through a workaround, not the intended native path.
Native vs Workaround Comparison
| Method | Status | Issue | Performance Impact |
|---|---|---|---|
| vLLM + Ray (native tensor parallelism) | FAILED | GB10 not recognized as GPU resource |
- |
| TensorRT-LLM + NVFP4 (native NVIDIA stack) | FAILED | SM121 GEMM kernels missing | - |
| llama.cpp + RPC (workaround) | WORKING | Uses TCP/IP, not NCCL | ~1-2μs extra latency |
Expected Native Flow (Did NOT Work):
vLLM → Ray → NCCL → NVLink/ConnectX-7 → Native Tensor Parallelism
(39.34 GB/s)
Workaround Flow Used:
llama.cpp → RPC → TCP/IP → Manual Layer Splitting
(added latency)
Performance Implications
- NCCL test: 39.34 GB/s throughput (hardware is working)
- RPC backend: Running over TCP/IP, NOT using NCCL
- Potential loss: Native solution could have been estimated 2-3x faster
- Current performance: 12.5 t/s (better than NVIDIA’s 11.73 t/s benchmark, but below native potential)
Note to NVIDIA: vLLM/Ray integration for GB10 GPUs and SM121 NVFP4 kernel support should be critical priorities. The hardware can deliver 39 GB/s NCCL throughput, but the software stack cannot utilize it.
1. System Specifications
Hardware
| Node | IP (QSFP) | GPU | GPU Memory | CPU | RAM |
|---|---|---|---|---|---|
| dgxnode1 | 169.254.1.1 | NVIDIA GB10 | 128GB UMA (~117GB usable) | 20 core | 119GB |
| dgxnode2 | 169.254.1.2 | NVIDIA GB10 | 128GB UMA (~117GB usable) | 20 core | 119GB |
Network
- Connection: QSFP 200GbE direct cable
- MTU: 9000 (Jumbo frames)
- Subnet: 169.254.0.0/16 (link-local)
Software Environment
- OS: Ubuntu 24.04 (Linux 6.14.0-1015-nvidia)
- CUDA: 13.0+ (SM121 Blackwell support)
- Driver: NVIDIA Container Toolkit
2. SUCCESSFUL OPERATIONS
2.1 NCCL Multi-Node Communication
Status: SUCCESS
NCCL version: 2.28.9-1
Test: nccl_message_transfer (all_reduce)
Performance: 39.34 GB/s throughput
Steps Taken:
- Configured MTU 9000 (jumbo frames)
- Set up Docker network configuration
- Configured NCCL environment variables
- Verified with nccl-tests
2.2 llama.cpp Build (CUDA + RPC Support)
Status: SUCCESS
Build Commands:
git clone --depth 1 https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DGGML_RPC=ON -DLLAMA_CURL=OFF
cmake --build build --config Release -j$(nproc)
Result:
llama-server,llama-cli,rpc-serverbinaries created- Compiled with SM121 (Blackwell) support
- Copied to both nodes (rsync)
2.3 spark-vllm-docker Build
Status: SUCCESS
Duration: 50 minutes 30 seconds
Image: vllm-spark:latest (23.4GB)
Features:
- Based on vLLM v0.12.0
- Compiled with SM121 CUDA kernels
- NVFP4 and AWQ quantization support
- Optimized with Triton compiler
2.4 Model Download (Qwen3-235B Q4_K_XL)
Status: SUCCESS
Model: unsloth/Qwen3-235B-A22B-GGUF (UD-Q4_K_XL quantization)
Size: 134GB (3 split files)
Duration: ~20 minutes (with hf_transfer)
Files:
/home/user/models/UD-Q4_K_XL/
├── Qwen3-235B-A22B-UD-Q4_K_XL-00001-of-00003.gguf (47GB)
├── Qwen3-235B-A22B-UD-Q4_K_XL-00002-of-00003.gguf (47GB)
└── Qwen3-235B-A22B-UD-Q4_K_XL-00003-of-00003.gguf (33GB)
Download Optimizations:
- Enabled
hf_transferlibrary - HuggingFace token authentication
HF_HUB_ENABLE_HF_TRANSFER=1environment variable
2.5 Single GPU Test (dgxnode1)
Status: SUCCESS
Performance: ~1.8 tokens/sec
Command:
./llama-server -m "$MODEL" -ngl 999 --host 0.0.0.0 --port 8082 -c 2048
Results:
- 95/95 layers loaded to GPU
- CUDA0 buffer: 115GB
- Model running entirely in GPU memory
2.6 Multi-Node RPC Test (2x DGX)
Status: SUCCESS
Performance: ~12.5 tokens/sec (7x speedup!)
Command:
./llama-server \
-m "$MODEL" \
--rpc "169.254.1.2:50052" \
-ngl 999 \
-fit off \
--host 0.0.0.0 \
--port 8082 \
-c 2048
Memory Distribution:
- CUDA0 (dgxnode1): 63GB
- RPC0 (dgxnode2): 64.5GB
- CPU Mapped: 334MB
API Test:
{
"prompt_per_second": 37.73,
"predicted_per_second": 12.50,
"total_tokens": 164
}
3. FAILED / PROBLEMATIC OPERATIONS
3.1 vLLM Ray Distributed Backend
Status: FAILED
Error: “Current node has no GPU available”
Root Cause:
Ray cluster registers GPUs as accelerator_type:GB10, but vLLM v1 engine expects the GPU resource key. This is a resource mapping issue.
Details:
Ray Node Resources:
- CPU: 20.0
- memory: 68GB
- accelerator_type:GB10: 1.0
- GPU: (MISSING!) <-- This is the problem
Attempted Solutions:
VLLM_USE_V1=0(legacy engine) - Did not work- Ray cluster restart - Did not work
- Placement group cleanup - Did not work
Potential Fixes:
- Force GPU detection via
CUDA_VISIBLE_DEVICES - Patch vLLM’s Ray resource detection code
- Test if spark-vllm-docker image resolves this issue
Recommendation for NVIDIA/vLLM Team:
The GB10 GPU is not being recognized as a standard GPU resource in Ray. vLLM’s worker initialization fails because it looks for GPU resource type, but Ray only registers accelerator_type:GB10. This needs either:
- Ray to also register a generic
GPUresource for GB10 - vLLM to recognize
accelerator_type:GB10as a valid GPU resource
3.2 NVFP4 Quantization
Status: FAILED
Error: “Failed to initialize GEMM Plugin”
Root Cause:
NVFP4 kernels are written for SM90 (Hopper) and do not support SM121 (Blackwell).
Details:
NVFP4 FP8 GEMM kernel not found
No compatible kernel for SM121
Recommendation for NVIDIA Team:
SM121 (Blackwell/GB10) needs NVFP4 GEMM kernel support. Currently only SM90 (Hopper) kernels are available in TensorRT-LLM and vLLM.
Workaround Used: AWQ quantization (32% faster than NVFP4 on DGX Spark anyway)
3.3 llama.cpp RPC Segmentation Fault
Status: RESOLVED (with workaround)
Error: Segfault during “fitting params to device memory”
Root Cause:
llama.cpp’s automatic memory fitting algorithm is incompatible with the RPC backend.
Solution:
Added -fit off parameter to disable automatic fitting.
# Does NOT work:
./llama-server -m $MODEL --rpc "..." -ngl 999
# WORKS:
./llama-server -m $MODEL --rpc "..." -ngl 999 -fit off
Recommendation for llama.cpp Team:
The automatic memory fitting feature crashes when RPC backend is enabled. Consider adding RPC-aware memory fitting or documenting this limitation.
3.4 dgxnode2 Docker GPU Access
Status: RESOLVED
Error: “Failed to initialize NVML: Unknown Error”
Root Cause:
/etc/docker/daemon.json file was missing on dgxnode2 (NVIDIA Container Toolkit not configured).
Solution:
# Copied from dgxnode1 to dgxnode2
scp /etc/docker/daemon.json dgxnode2:/etc/docker/
ssh dgxnode2 "systemctl daemon-reload && systemctl restart docker"
daemon.json contents:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
Note: This may be a DGX Spark setup issue - the second node did not have Docker properly configured for GPU access out of the box.
3.5 aria2c Download Issue
Status: RESOLVED
Problem:
When starting download with aria2c, there was incompatibility with previous hf download data. Files appeared as sparse.
Solution:
Used hf download with hf_transfer instead of aria2c.
4. PERFORMANCE COMPARISON
| Configuration | Tokens/sec | Notes |
|---|---|---|
| CPU Only (single node) | ~0.1 | Too slow, not practical |
| Single GPU (dgxnode1) | ~1.8 | Entire model 115GB in GPU |
| Multi-node (2x DGX RPC) | ~12.5 | 7x speedup |
Prompt Processing: 37.7 tokens/sec
Token Generation: 12.5 tokens/sec
5. CURRENT WORKING STATE
Running Services
| Service | Host | Port | Status |
|---|---|---|---|
| llama-server | dgxnode1 | 8082 | RUNNING |
| rpc-server | dgxnode1 | 50052 | RUNNING |
| rpc-server | dgxnode2 | 50052 | RUNNING |
API Usage
curl http://dgxnode1:8082/v1/chat/completions \
-H "Content-Type: application/json" \
-d @request.json
Example Request
{
"model": "qwen3-235b",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}
6. NOTES FOR FUTURE UPDATES
Items to Check on Driver/Software Updates
-
vLLM Ray GPU Resource Issue
- vLLM version: 0.12.0
- Ray version: 2.x
- Issue:
accelerator_type:GB10vsGPUresource mapping - May be fixed in newer versions
-
NVFP4 SM121 Support
- TensorRT-LLM and vLLM need SM121 NVFP4 kernel support
- May come with CUDA 13.x updates
-
llama.cpp RPC Stability
-fit offworkaround required- May be fixed in newer versions
Recommended Configuration (Production)
# Start RPC Servers (on each node)
cd ~/llama.cpp/build/bin
export LD_LIBRARY_PATH=$PWD:$LD_LIBRARY_PATH
./rpc-server -H <NODE_IP> -p 50052
# Start llama-server (on master node)
./llama-server \
-m /home/user/models/UD-Q4_K_XL/Qwen3-235B-A22B-UD-Q4_K_XL-00001-of-00003.gguf \
--rpc "169.254.1.2:50052" \
-ngl 999 \
-fit off \
--host 0.0.0.0 \
--port 8082 \
-c 4096
7. FILE LOCATIONS
| File/Directory | Location |
|---|---|
| llama.cpp build | /home/user/llama.cpp/build/bin/ |
| llama.cpp (dgxnode2) | /home/user/llama.cpp-build/bin/ |
| Qwen3-235B model | /home/user/models/UD-Q4_K_XL/ |
| spark-vllm-docker | /home/user/spark-vllm-docker/ |
| vLLM image | vllm-spark:latest |
| Ray cluster script | /home/user/run_cluster.sh |
8. CONCLUSION
Successfully ran Qwen3-235B (235 billion parameters) model on 2x DGX Spark (GB10 Blackwell GPU) system.
Key Achievements:
- Multi-node distributed inference with llama.cpp RPC backend
- 12.5 tokens/sec performance (7x speedup compared to single GPU)
- 127GB model memory distributed across two GPUs
Outstanding Issues:
- vLLM Ray backend issue (GPU resource mapping) - Needs fix from vLLM/Ray team
- NVFP4 quantization support (SM121 kernel missing) - Needs fix from NVIDIA team
Recommendation:
For production environments, the llama.cpp RPC solution is stable and performant. For vLLM solution, newer versions need to be monitored for GB10/SM121 compatibility fixes.
9. SUMMARY FOR NVIDIA TEAM
Critical Issues Requiring Attention:
-
GB10 GPU Not Recognized as “GPU” Resource in Ray
- Impact: vLLM multi-node inference completely broken
- Workaround: None (had to use llama.cpp instead)
- Suggested Fix: Ensure Ray registers GB10 as both
accelerator_type:GB10AND genericGPUresource
-
Missing NVFP4 Kernels for SM121
- Impact: Cannot use NVFP4 quantization on DGX Spark
- Workaround: Use AWQ quantization
- Suggested Fix: Add SM121 GEMM kernels to TensorRT-LLM/vLLM
-
Docker daemon.json Missing on Second DGX Spark Node
- Impact: GPU not accessible in Docker containers
- Workaround: Manually copy config from first node
- Suggested Fix: Ensure NVIDIA Container Toolkit is properly configured on all nodes during DGX setup
10. TRACKING LINKS FOR UPDATES
The following GitHub issues and forum threads should be monitored for fixes to the problems encountered in this report.
Critical Priority - Must Watch
| Link | Description | Status |
|---|---|---|
| vLLM #30163 | NVFP4 on 2x DGX Spark - Exact same scenario | OPEN |
| vLLM #12614 | “Current node has no GPU available” - Main issue | OPEN |
| llama.cpp #13083 | Tensor Parallelism over RPC - Would give 2-3x speedup | OPEN |
vLLM + Ray GPU Detection Issues
| Link | Description |
|---|---|
| vLLM #13093 | Ray distributed “no GPU” error |
| vLLM #14109 | Fractional GPU resource name issue |
| Ray #59064 | Ray Serve + vLLM v1 placement group conflict |
NVIDIA Forum - DGX Spark / GB10 Issues
| Link | Description |
|---|---|
| Two Sparks Does Not Work | Multi-node vLLM issue |
| NIM Containers Fail on SM121 | Triton/vLLM SM121 crash |
| vLLM Container Issue | Container problems |
| vLLM Forums - DGX Spark | vLLM forum discussion |
TensorRT-LLM / NVFP4 SM121 Support
| Link | Description |
|---|---|
| TensorRT-LLM Releases | Watch for SM121 kernel support |
| TensorRT-LLM #3591 | Blackwell + FP4 issue |
| TensorRT-LLM #5018 | RTX 5090 NVFP4 support |
| Support Matrix | Official supported GPUs |
llama.cpp RPC Improvements
| Link | Description |
|---|---|
| llama.cpp #9086 | Tensor Parallelism support |
| llama.cpp #15463 | RPC dual-node 50% GPU utilization bug |
| DGX Spark Discussion | DGX Spark benchmarks |
Release Pages (Check Weekly)
| Link | Description |
|---|---|
| llama.cpp Releases | New versions |
| vLLM Releases | vLLM updates |
| TensorRT-LLM Releases | TRT-LLM updates |
| DGX Spark Forum | Main DGX Spark forum |
Report Date: December 17, 2025
System: dgxnode1 + dgxnode2 (2x DGX Spark)
Author: Testing multi-node LLM inference on DGX Spark cluster