System Information
Product: NVIDIA DGX Spark (2 units)
Software Configuration:
-
OS: Ubuntu 24.04 (DGX OS 7.2.3)
-
Kernel: 6.11.0-1016-nvidia (aarch64)
-
GPU Driver: NVIDIA 580.95.05 (nvidia-580-open)
-
CUDA Toolkit: 13.0.2
-
NCCL: 2.27.7
-
MLNX_OFED: 24.07-0.6.1.0
-
Architecture: ARM64 (aarch64)
Hardware Configuration:
-
2× DGX Spark systems with GB10 Grace Blackwell Superchips
-
Connected via 200Gbps InfiniBand (Mellanox ConnectX-7)
-
Network: 10.0.0.1 and 10.0.0.2
-
Static IP configuration on enp1s0f0np0 interface
Problem Description
Issue: GPU Direct RDMA is not functional between our two DGX Spark systems, forcing distributed training to use a slow CPU bounce buffer path instead of direct GPU-to-GPU transfers over InfiniBand.
Impact: Distributed training performance is degraded by 3-5x due to:
-
Data path: GPU → CPU RAM → InfiniBand → CPU RAM → GPU (instead of direct GPU → IB → GPU)
-
Bandwidth limited to ~25-30 GB/s instead of the full ~200 Gbps link capacity
-
High CPU overhead during training
-
Underutilized $32,000+ hardware investment
Current Status:
-
✅ MLNX_OFED 24.07 installed and working correctly
-
✅ InfiniBand link active (200Gbps)
-
✅ RDMA communication working
-
✅ NCCL using NET/IB transport
-
✅ Distributed training completes (but slowly via CPU path)
-
❌ GPU Direct RDMA explicitly disabled by NCCL
-
❌ nvidia-peermem kernel module fails to load
Technical Details
Root Cause Identified
The nvidia-peermem kernel module located at:
/lib/modules/6.11.0-1016-nvidia/kernel/nvidia-580-open/nvidia-peermem.ko
Fails to load with “Invalid argument” error:
$ sudo modprobe nvidia-peermem
modprobe: ERROR: could not insert 'nvidia_peermem': Invalid argument
System logs show:
$ sudo dmesg | tail
[timestamp] nvidia_peermem: Unknown symbol ib_register_peer_memory_client
Analysis
Based on NVIDIA’s own documentation and community reports, this occurs when:
-
The NVIDIA GPU driver is installed before MLNX_OFED
-
The
nvidia-peermemmodule is compiled without the MLNX_OFED peer memory symbols -
Later installation of MLNX_OFED does not trigger recompilation of
nvidia-peermem -
The module lacks the required
ib_register_peer_memory_clientandib_unregister_peer_memory_clientsymbols
From NVIDIA’s documentation (GPUDirect RDMA guide):
“If the NVIDIA GPU driver is installed before MOFED, the GPU driver must be uninstalled and installed again to make sure nvidia-peermem is compiled with the RDMA APIs that are provided by MLNX_OFED.”
NCCL Diagnostic Output
$ NCCL_DEBUG=INFO [training script]
...
NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB enp1s0f0np0:10.0.0.1<0>
NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' (CUPTI metric failed)
...
The “GPU Direct RDMA Disabled” message confirms NCCL cannot use GPU Direct.
Missing System Components
$ ls /sys/kernel/mm/memory_peers/
ls: cannot access '/sys/kernel/mm/memory_peers/': No such file or directory
This directory should exist when peer memory support is active.
What We’ve Investigated
Verification That Other Components Work
-
MLNX_OFED is correctly installed:
$ ofed_info -s MLNX_OFED_LINUX-24.07-0.6.1.0 -
InfiniBand link is active:
$ ibstat CA 'mlx5_0' Port 1: State: Active Physical state: LinkUp Rate: 200 -
RDMA devices are available:
$ ibv_devinfo | grep state state: PORT_ACTIVE (4) -
Network connectivity works:
$ ping 10.0.0.2 PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data. 64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=0.156 ms -
Distributed training completes successfully (but slowly):
-
Training runs to completion
-
NCCL uses IB transport (not falling back to TCP)
-
Performance is degraded due to CPU bounce buffer path
-
Research Conducted
We’ve found this is a known issue affecting multiple platforms:
-
Grace Hopper (GH200) systems on Rocky Linux
-
Various Ubuntu versions (16.04 through 22.04)
-
Both x86_64 and ARM64 architectures
-
Multiple kernel versions (4.4 through 6.11)
Reference Issues:
-
NVIDIA Forums: “modprobe nvidia-peermem silently fails with EINVAL” (April 2024)
-
GitHub Mellanox/nv_peer_memory: Issues #28, #116
-
Rocky Linux Forums: “Mismatched Version of Kernel Symbol” (July 2024)
Why This Is Surprising for DGX Spark
While this is a known issue on DIY systems, we did not expect this on DGX Spark because:
-
❌ DGX systems ship with pre-validated software stacks
-
❌ DGX OS 7.2.3 is a custom Ubuntu image specifically for DGX Spark
-
❌ These are $16,000+ enterprise systems, not DIY builds
-
❌ Product launched October 15, 2025 - brand new, should be tested
-
❌ GPU Direct RDMA is a core advertised feature for multi-system clusters
The ConnectX-7 networking and dual-system clustering are prominently featured in DGX Spark marketing materials, implying GPU Direct RDMA functionality.
What We’ve Tried
Attempted Solution 1: Manual Module Loading
$ sudo modprobe nvidia-peermem
modprobe: ERROR: could not insert 'nvidia_peermem': Invalid argument
Result: Failed with “Invalid argument”
Attempted Solution 2: Checking Module Dependencies
$ modinfo /lib/modules/6.11.0-1016-nvidia/kernel/nvidia-580-open/nvidia-peermem.ko
[Module exists but lacks proper MLNX_OFED symbols]
Result: Module exists but compiled without MLNX_OFED peer memory API
Attempted Solution 3: Researching Alternative (nv_peer_mem)
Based on research, the standard solution is to use nv_peer_mem from MLNX_OFED instead of nvidia-peermem from the GPU driver. However:
$ apt-cache search nvidia-peer-memory
[No results on ARM64 DGX OS 7.2.3]
The nvidia-peer-memory-dkms package appears to not be available in the DGX Spark software repositories.
What We’re Requesting
Immediate Need
1. Official workaround or fix to enable GPU Direct RDMA on DGX Spark systems running DGX OS 7.2.3
Possible solutions:
-
Provide
nvidia-peer-memory-dkmspackage for ARM64/Ubuntu 24.04 -
Provide instructions for reinstalling GPU driver after MLNX_OFED
-
Provide alternative peer memory module
-
Issue software update that fixes the installation order
2. Validation that our diagnosis is correct and this is the issue
3. Timeline for official fix if not immediately available
Long-Term Request
4. Update DGX Spark software image to prevent this issue on future shipments by:
-
Installing MLNX_OFED before GPU driver, or
-
Including properly compiled nvidia-peermem, or
-
Including nv_peer_mem from MLNX_OFED in the base image
5. Documentation update to address this issue in DGX Spark setup guides
Business Impact
Current State:
-
Two DGX Spark systems are operational but severely performance-limited
-
Distributed training is 3-5x slower than it should be
-
Cannot realize the value of the multi-system investment
-
Development and research productivity is significantly hampered
Use Case: We are building AI training infrastructure for:
-
Large language model fine-tuning (70B-200B parameters)
-
Distributed training across both DGX Spark systems
-
MSP service development leveraging local AI compute
-
Customer-facing AI applications
The lack of GPU Direct RDMA fundamentally undermines the purpose of having two interconnected DGX Spark systems.
Additional Information Available
We can provide upon request:
-
Complete
dmesgoutput -
Full
nvidia-bug-report.shoutput -
NCCL debug logs
-
Network topology diagrams
-
Test scripts and benchmarks
-
Any other diagnostic information needed
Preferred Contact Method
[INSERT YOUR PREFERRED CONTACT METHOD]
-
Email: [INSERT EMAIL]
-
Phone: [INSERT PHONE]
-
Time Zone: US Central Time (Dallas, TX)
-
Availability: [INSERT AVAILABILITY]
Urgency
Priority: High
Reason: Core functionality (GPU Direct RDMA for multi-system training) is not working on newly purchased enterprise hardware. This directly impacts our ability to use the systems for their intended purpose.
Timeline Need: We need a solution within 1-2 weeks to avoid significant project delays.
Summary
Two DGX Spark systems (serial numbers [INSERT]) cannot use GPU Direct RDMA for distributed training due to nvidia-peermem module compilation issue. This appears to be caused by DGX OS 7.2.3 installing the GPU driver before MLNX_OFED, resulting in nvidia-peermem being compiled without required peer memory symbols. We need an official fix or workaround to enable this core advertised feature.
Thank you for your assistance.
Jason Brashear
Titanium Computing