Enabling GPU Direct RDMA for DGX Spark Clustering

You don’t benefit from GPU Direct if you’re writing your own code because of unified memory. The memory pointers for host and device are identical. So if a GPU writes to itself, and the CPU just pulls that memory pointer directly without using cudaMemcpy, you get maximum bandwidth. Then over 200GbE, you’re basically saturating the network connection already and can push data from GPU2 to GPU1 at >22GB/sec.

If you’re using legacy software that falls down to cudaMemcpy as fallback when it doesn’t detect GPUDirect, it’s potentially slower. Even though cudaMemcpy has more bandwidth than 200GbE, you are adding latency.

I wrote a benchmark to show this. It will only work for 2 nodes and has zero security features. It assumes that if you can SSH to a machine, you’re allowed to copy and execute code. The host copies the benchmark to ~/nccl_benchmark upon deployment.

NCCLBenchmark-DGXSpark-AlanBCDang.zip (35.5 MB)

┌─ GPU→GPU DIRECT (Rank 1 → Rank 0, unidirectional) ──────────┐
Tests: GPU2 buffer → network → GPU1 buffer (no CPU copies)
This is the path GPUDirect RDMA optimizes on discrete GPUs.
On unified memory, we get equivalent performance without it.
Size Bandwidth Latency Eff. OK
8 MB 20.39 GB/s 411.5 µs 81.5% ✓
16 MB 21.45 GB/s 782.3 µs 85.8% ✓
32 MB 21.48 GB/s 1562.4 µs 85.9% ✓
64 MB 22.31 GB/s 3007.6 µs 89.3% ✓

3 Likes