GPU Direct RDMA Not Working on DGX Spark Systems - nvidia-peermem Module Fails to Load

Semfreak · November 2, 2025, 5:55am

System Information

Product: NVIDIA DGX Spark (2 units)

Software Configuration:

OS: Ubuntu 24.04 (DGX OS 7.2.3)
Kernel: 6.11.0-1016-nvidia (aarch64)
GPU Driver: NVIDIA 580.95.05 (nvidia-580-open)
CUDA Toolkit: 13.0.2
NCCL: 2.27.7
MLNX_OFED: 24.07-0.6.1.0
Architecture: ARM64 (aarch64)

Hardware Configuration:

2× DGX Spark systems with GB10 Grace Blackwell Superchips
Connected via 200Gbps InfiniBand (Mellanox ConnectX-7)
Network: 10.0.0.1 and 10.0.0.2
Static IP configuration on enp1s0f0np0 interface

Problem Description

Issue: GPU Direct RDMA is not functional between our two DGX Spark systems, forcing distributed training to use a slow CPU bounce buffer path instead of direct GPU-to-GPU transfers over InfiniBand.

Impact: Distributed training performance is degraded by 3-5x due to:

Data path: GPU → CPU RAM → InfiniBand → CPU RAM → GPU (instead of direct GPU → IB → GPU)
Bandwidth limited to ~25-30 GB/s instead of the full ~200 Gbps link capacity
High CPU overhead during training
Underutilized $32,000+ hardware investment

Current Status:

✅ MLNX_OFED 24.07 installed and working correctly
✅ InfiniBand link active (200Gbps)
✅ RDMA communication working
✅ NCCL using NET/IB transport
✅ Distributed training completes (but slowly via CPU path)
❌ GPU Direct RDMA explicitly disabled by NCCL
❌ nvidia-peermem kernel module fails to load

Technical Details

Root Cause Identified

The nvidia-peermem kernel module located at:

/lib/modules/6.11.0-1016-nvidia/kernel/nvidia-580-open/nvidia-peermem.ko

Fails to load with “Invalid argument” error:

$ sudo modprobe nvidia-peermem
modprobe: ERROR: could not insert 'nvidia_peermem': Invalid argument

System logs show:

$ sudo dmesg | tail
[timestamp] nvidia_peermem: Unknown symbol ib_register_peer_memory_client

Analysis

Based on NVIDIA’s own documentation and community reports, this occurs when:

The NVIDIA GPU driver is installed before MLNX_OFED
The nvidia-peermem module is compiled without the MLNX_OFED peer memory symbols
Later installation of MLNX_OFED does not trigger recompilation of nvidia-peermem
The module lacks the required ib_register_peer_memory_client and ib_unregister_peer_memory_client symbols

From NVIDIA’s documentation (GPUDirect RDMA guide):

“If the NVIDIA GPU driver is installed before MOFED, the GPU driver must be uninstalled and installed again to make sure nvidia-peermem is compiled with the RDMA APIs that are provided by MLNX_OFED.”

NCCL Diagnostic Output

$ NCCL_DEBUG=INFO [training script]
...
NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB enp1s0f0np0:10.0.0.1<0>
NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' (CUPTI metric failed)
...

The “GPU Direct RDMA Disabled” message confirms NCCL cannot use GPU Direct.

Missing System Components

$ ls /sys/kernel/mm/memory_peers/
ls: cannot access '/sys/kernel/mm/memory_peers/': No such file or directory

This directory should exist when peer memory support is active.

What We’ve Investigated

Verification That Other Components Work

MLNX_OFED is correctly installed:

$ ofed_info -s
MLNX_OFED_LINUX-24.07-0.6.1.0

InfiniBand link is active:

$ ibstat
CA 'mlx5_0'
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 200

RDMA devices are available:

$ ibv_devinfo | grep state
        state:                  PORT_ACTIVE (4)

Network connectivity works:

$ ping 10.0.0.2
PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=0.156 ms

Distributed training completes successfully (but slowly):
- Training runs to completion
- NCCL uses IB transport (not falling back to TCP)
- Performance is degraded due to CPU bounce buffer path

Research Conducted

We’ve found this is a known issue affecting multiple platforms:

Grace Hopper (GH200) systems on Rocky Linux
Various Ubuntu versions (16.04 through 22.04)
Both x86_64 and ARM64 architectures
Multiple kernel versions (4.4 through 6.11)

Reference Issues:

NVIDIA Forums: “modprobe nvidia-peermem silently fails with EINVAL” (April 2024)
GitHub Mellanox/nv_peer_memory: Issues #28, #116
Rocky Linux Forums: “Mismatched Version of Kernel Symbol” (July 2024)

Why This Is Surprising for DGX Spark

While this is a known issue on DIY systems, we did not expect this on DGX Spark because:

❌ DGX systems ship with pre-validated software stacks
❌ DGX OS 7.2.3 is a custom Ubuntu image specifically for DGX Spark
❌ These are $16,000+ enterprise systems, not DIY builds
❌ Product launched October 15, 2025 - brand new, should be tested
❌ GPU Direct RDMA is a core advertised feature for multi-system clusters

The ConnectX-7 networking and dual-system clustering are prominently featured in DGX Spark marketing materials, implying GPU Direct RDMA functionality.

What We’ve Tried

Attempted Solution 1: Manual Module Loading

$ sudo modprobe nvidia-peermem
modprobe: ERROR: could not insert 'nvidia_peermem': Invalid argument

Result: Failed with “Invalid argument”

Attempted Solution 2: Checking Module Dependencies

$ modinfo /lib/modules/6.11.0-1016-nvidia/kernel/nvidia-580-open/nvidia-peermem.ko
[Module exists but lacks proper MLNX_OFED symbols]

Result: Module exists but compiled without MLNX_OFED peer memory API

Attempted Solution 3: Researching Alternative (nv_peer_mem)

Based on research, the standard solution is to use nv_peer_mem from MLNX_OFED instead of nvidia-peermem from the GPU driver. However:

$ apt-cache search nvidia-peer-memory
[No results on ARM64 DGX OS 7.2.3]

The nvidia-peer-memory-dkms package appears to not be available in the DGX Spark software repositories.

What We’re Requesting

Immediate Need

1. Official workaround or fix to enable GPU Direct RDMA on DGX Spark systems running DGX OS 7.2.3

Possible solutions:

Provide nvidia-peer-memory-dkms package for ARM64/Ubuntu 24.04
Provide instructions for reinstalling GPU driver after MLNX_OFED
Provide alternative peer memory module
Issue software update that fixes the installation order

2. Validation that our diagnosis is correct and this is the issue

3. Timeline for official fix if not immediately available

Long-Term Request

4. Update DGX Spark software image to prevent this issue on future shipments by:

Installing MLNX_OFED before GPU driver, or
Including properly compiled nvidia-peermem, or
Including nv_peer_mem from MLNX_OFED in the base image

5. Documentation update to address this issue in DGX Spark setup guides

Business Impact

Current State:

Two DGX Spark systems are operational but severely performance-limited
Distributed training is 3-5x slower than it should be
Cannot realize the value of the multi-system investment
Development and research productivity is significantly hampered

Use Case: We are building AI training infrastructure for:

Large language model fine-tuning (70B-200B parameters)
Distributed training across both DGX Spark systems
MSP service development leveraging local AI compute
Customer-facing AI applications

The lack of GPU Direct RDMA fundamentally undermines the purpose of having two interconnected DGX Spark systems.

Additional Information Available

We can provide upon request:

Complete dmesg output
Full nvidia-bug-report.sh output
NCCL debug logs
Network topology diagrams
Test scripts and benchmarks
Any other diagnostic information needed

Preferred Contact Method

[INSERT YOUR PREFERRED CONTACT METHOD]

Email: [INSERT EMAIL]
Phone: [INSERT PHONE]
Time Zone: US Central Time (Dallas, TX)
Availability: [INSERT AVAILABILITY]

Urgency

Priority: High

Reason: Core functionality (GPU Direct RDMA for multi-system training) is not working on newly purchased enterprise hardware. This directly impacts our ability to use the systems for their intended purpose.

Timeline Need: We need a solution within 1-2 weeks to avoid significant project delays.

Summary

Two DGX Spark systems (serial numbers [INSERT]) cannot use GPU Direct RDMA for distributed training due to nvidia-peermem module compilation issue. This appears to be caused by DGX OS 7.2.3 installing the GPU driver before MLNX_OFED, resulting in nvidia-peermem being compiled without required peer memory symbols. We need an official fix or workaround to enable this core advertised feature.

Thank you for your assistance.

Jason Brashear
Titanium Computing

NVES · November 2, 2025, 3:29pm

For Spark RDMA details, please reference DGX Spark / GB10 FAQ

Topic		Replies	Views
DGX Spark GPUDirect RDMA DGX Spark / GB10 cuda , rdma-and-roce	4	291	October 29, 2025
GPUDirect RDMA support with CUDA 5 CUDA Programming and Performance	19	9323	May 28, 2013
GPU Direct 1 and 2 CUDA Programming and Performance	9	3590	June 15, 2011
RDMA GPUDirect//nvidia-peer-memory/cuda issue RDMA Software For GPU software-and-drivers , howto-enable-verify-and-troubleshoo	11	2353	September 12, 2019
GPUDirect RDMA performance CUDA Programming and Performance	2	2224	March 26, 2013
ibv_reg_mr got file exists error when used nv_peer_mem	2	531	September 9, 2017
Trying to get GPUdirect RDMA working. CUDA Setup and Installation	2	1655	April 10, 2014
RDMA between K80 GPU and remote host RDMA Software For GPU	1	657	May 28, 2017
cuda 4.0rc2 cudaMemcpyPeer(Async) performance issues CUDA Programming and Performance	11	13135	May 3, 2011
Rivermax & GPUDirect Network Management Products gpu , inception , rivermax	5	2549	October 6, 2022