Enabling GPU Direct RDMA for DGX Spark Clustering

alan.dang · December 11, 2025, 3:33pm

You don’t benefit from GPU Direct if you’re writing your own code because of unified memory. The memory pointers for host and device are identical. So if a GPU writes to itself, and the CPU just pulls that memory pointer directly without using cudaMemcpy, you get maximum bandwidth. Then over 200GbE, you’re basically saturating the network connection already and can push data from GPU2 to GPU1 at >22GB/sec.

If you’re using legacy software that falls down to cudaMemcpy as fallback when it doesn’t detect GPUDirect, it’s potentially slower. Even though cudaMemcpy has more bandwidth than 200GbE, you are adding latency.

I wrote a benchmark to show this. It will only work for 2 nodes and has zero security features. It assumes that if you can SSH to a machine, you’re allowed to copy and execute code. The host copies the benchmark to ~/nccl_benchmark upon deployment.

NCCLBenchmark-DGXSpark-AlanBCDang.zip (35.5 MB)

┌─ GPU→GPU DIRECT (Rank 1 → Rank 0, unidirectional) ──────────┐
Tests: GPU2 buffer → network → GPU1 buffer (no CPU copies)
This is the path GPUDirect RDMA optimizes on discrete GPUs.
On unified memory, we get equivalent performance without it.
Size Bandwidth Latency Eff. OK
8 MB 20.39 GB/s 411.5 µs 81.5% ✓
16 MB 21.45 GB/s 782.3 µs 85.8% ✓
32 MB 21.48 GB/s 1562.4 µs 85.9% ✓
64 MB 22.31 GB/s 3007.6 µs 89.3% ✓

Topic		Replies	Views
GPU Direct RDMA Not Working on DGX Spark Systems - nvidia-peermem Module Fails to Load DGX Spark / GB10	2	464	November 2, 2025
DGX Spark GPUDirect RDMA DGX Spark / GB10 cuda , rdma-and-roce	5	1175	October 29, 2025
ConnectX-7 NIC in DGX Spark DGX Spark / GB10	67	4730	December 2, 2025
Successful 2 DGX Spark cluster setup? DGX Spark / GB10	12	3123	October 21, 2025
GPUDirect RDMA support with CUDA 5 CUDA Programming and Performance	19	9471	May 28, 2013
DGX Spark NCCL Test: 10GB/s not 200 Gbps=25 GB/s DGX Spark / GB10	3	773	November 5, 2025
GDS support for DGX Spark DGX Spark / GB10	5	409	November 13, 2025
Hardware issue DGX Spark / GB10 cuda , kernel	9	519	December 31, 2025
Clarification on requirements for GPUDirect RDMA CUDA Programming and Performance	16	5277	November 7, 2023
DGX Spark ↔ EdgeXpert NCCL only ~17 GB/s over 200GbE DGX Spark / GB10	4	243	April 9, 2026

Enabling GPU Direct RDMA for DGX Spark Clustering

Related topics