GPU Direct RDMA Not Working on DGX Spark Systems - nvidia-peermem Module Fails to Load

System Information

Product: NVIDIA DGX Spark (2 units)

Software Configuration:

  • OS: Ubuntu 24.04 (DGX OS 7.2.3)

  • Kernel: 6.11.0-1016-nvidia (aarch64)

  • GPU Driver: NVIDIA 580.95.05 (nvidia-580-open)

  • CUDA Toolkit: 13.0.2

  • NCCL: 2.27.7

  • MLNX_OFED: 24.07-0.6.1.0

  • Architecture: ARM64 (aarch64)

Hardware Configuration:

  • 2× DGX Spark systems with GB10 Grace Blackwell Superchips

  • Connected via 200Gbps InfiniBand (Mellanox ConnectX-7)

  • Network: 10.0.0.1 and 10.0.0.2

  • Static IP configuration on enp1s0f0np0 interface


Problem Description

Issue: GPU Direct RDMA is not functional between our two DGX Spark systems, forcing distributed training to use a slow CPU bounce buffer path instead of direct GPU-to-GPU transfers over InfiniBand.

Impact: Distributed training performance is degraded by 3-5x due to:

  • Data path: GPU → CPU RAM → InfiniBand → CPU RAM → GPU (instead of direct GPU → IB → GPU)

  • Bandwidth limited to ~25-30 GB/s instead of the full ~200 Gbps link capacity

  • High CPU overhead during training

  • Underutilized $32,000+ hardware investment

Current Status:

  • ✅ MLNX_OFED 24.07 installed and working correctly

  • ✅ InfiniBand link active (200Gbps)

  • ✅ RDMA communication working

  • ✅ NCCL using NET/IB transport

  • ✅ Distributed training completes (but slowly via CPU path)

  • ❌ GPU Direct RDMA explicitly disabled by NCCL

  • ❌ nvidia-peermem kernel module fails to load


Technical Details

Root Cause Identified

The nvidia-peermem kernel module located at:

/lib/modules/6.11.0-1016-nvidia/kernel/nvidia-580-open/nvidia-peermem.ko

Fails to load with “Invalid argument” error:

$ sudo modprobe nvidia-peermem
modprobe: ERROR: could not insert 'nvidia_peermem': Invalid argument

System logs show:

$ sudo dmesg | tail
[timestamp] nvidia_peermem: Unknown symbol ib_register_peer_memory_client

Analysis

Based on NVIDIA’s own documentation and community reports, this occurs when:

  1. The NVIDIA GPU driver is installed before MLNX_OFED

  2. The nvidia-peermem module is compiled without the MLNX_OFED peer memory symbols

  3. Later installation of MLNX_OFED does not trigger recompilation of nvidia-peermem

  4. The module lacks the required ib_register_peer_memory_client and ib_unregister_peer_memory_client symbols

From NVIDIA’s documentation (GPUDirect RDMA guide):

“If the NVIDIA GPU driver is installed before MOFED, the GPU driver must be uninstalled and installed again to make sure nvidia-peermem is compiled with the RDMA APIs that are provided by MLNX_OFED.”

NCCL Diagnostic Output

$ NCCL_DEBUG=INFO [training script]
...
NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB enp1s0f0np0:10.0.0.1<0>
NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' (CUPTI metric failed)
...

The “GPU Direct RDMA Disabled” message confirms NCCL cannot use GPU Direct.

Missing System Components

$ ls /sys/kernel/mm/memory_peers/
ls: cannot access '/sys/kernel/mm/memory_peers/': No such file or directory

This directory should exist when peer memory support is active.


What We’ve Investigated

Verification That Other Components Work

  1. MLNX_OFED is correctly installed:

    $ ofed_info -s
    MLNX_OFED_LINUX-24.07-0.6.1.0
    
    
  2. InfiniBand link is active:

    $ ibstat
    CA 'mlx5_0'
        Port 1:
            State: Active
            Physical state: LinkUp
            Rate: 200
    
    
  3. RDMA devices are available:

    $ ibv_devinfo | grep state
            state:                  PORT_ACTIVE (4)
    
    
  4. Network connectivity works:

    $ ping 10.0.0.2
    PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
    64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=0.156 ms
    
    
  5. Distributed training completes successfully (but slowly):

    • Training runs to completion

    • NCCL uses IB transport (not falling back to TCP)

    • Performance is degraded due to CPU bounce buffer path

Research Conducted

We’ve found this is a known issue affecting multiple platforms:

  • Grace Hopper (GH200) systems on Rocky Linux

  • Various Ubuntu versions (16.04 through 22.04)

  • Both x86_64 and ARM64 architectures

  • Multiple kernel versions (4.4 through 6.11)

Reference Issues:

  • NVIDIA Forums: “modprobe nvidia-peermem silently fails with EINVAL” (April 2024)

  • GitHub Mellanox/nv_peer_memory: Issues #28, #116

  • Rocky Linux Forums: “Mismatched Version of Kernel Symbol” (July 2024)


Why This Is Surprising for DGX Spark

While this is a known issue on DIY systems, we did not expect this on DGX Spark because:

  1. ❌ DGX systems ship with pre-validated software stacks

  2. ❌ DGX OS 7.2.3 is a custom Ubuntu image specifically for DGX Spark

  3. ❌ These are $16,000+ enterprise systems, not DIY builds

  4. ❌ Product launched October 15, 2025 - brand new, should be tested

  5. ❌ GPU Direct RDMA is a core advertised feature for multi-system clusters

The ConnectX-7 networking and dual-system clustering are prominently featured in DGX Spark marketing materials, implying GPU Direct RDMA functionality.


What We’ve Tried

Attempted Solution 1: Manual Module Loading

$ sudo modprobe nvidia-peermem
modprobe: ERROR: could not insert 'nvidia_peermem': Invalid argument

Result: Failed with “Invalid argument”

Attempted Solution 2: Checking Module Dependencies

$ modinfo /lib/modules/6.11.0-1016-nvidia/kernel/nvidia-580-open/nvidia-peermem.ko
[Module exists but lacks proper MLNX_OFED symbols]

Result: Module exists but compiled without MLNX_OFED peer memory API

Attempted Solution 3: Researching Alternative (nv_peer_mem)

Based on research, the standard solution is to use nv_peer_mem from MLNX_OFED instead of nvidia-peermem from the GPU driver. However:

$ apt-cache search nvidia-peer-memory
[No results on ARM64 DGX OS 7.2.3]

The nvidia-peer-memory-dkms package appears to not be available in the DGX Spark software repositories.


What We’re Requesting

Immediate Need

1. Official workaround or fix to enable GPU Direct RDMA on DGX Spark systems running DGX OS 7.2.3

Possible solutions:

  • Provide nvidia-peer-memory-dkms package for ARM64/Ubuntu 24.04

  • Provide instructions for reinstalling GPU driver after MLNX_OFED

  • Provide alternative peer memory module

  • Issue software update that fixes the installation order

2. Validation that our diagnosis is correct and this is the issue

3. Timeline for official fix if not immediately available

Long-Term Request

4. Update DGX Spark software image to prevent this issue on future shipments by:

  • Installing MLNX_OFED before GPU driver, or

  • Including properly compiled nvidia-peermem, or

  • Including nv_peer_mem from MLNX_OFED in the base image

5. Documentation update to address this issue in DGX Spark setup guides


Business Impact

Current State:

  • Two DGX Spark systems are operational but severely performance-limited

  • Distributed training is 3-5x slower than it should be

  • Cannot realize the value of the multi-system investment

  • Development and research productivity is significantly hampered

Use Case: We are building AI training infrastructure for:

  • Large language model fine-tuning (70B-200B parameters)

  • Distributed training across both DGX Spark systems

  • MSP service development leveraging local AI compute

  • Customer-facing AI applications

The lack of GPU Direct RDMA fundamentally undermines the purpose of having two interconnected DGX Spark systems.


Additional Information Available

We can provide upon request:

  • Complete dmesg output

  • Full nvidia-bug-report.sh output

  • NCCL debug logs

  • Network topology diagrams

  • Test scripts and benchmarks

  • Any other diagnostic information needed


Preferred Contact Method

[INSERT YOUR PREFERRED CONTACT METHOD]

  • Email: [INSERT EMAIL]

  • Phone: [INSERT PHONE]

  • Time Zone: US Central Time (Dallas, TX)

  • Availability: [INSERT AVAILABILITY]


Urgency

Priority: High

Reason: Core functionality (GPU Direct RDMA for multi-system training) is not working on newly purchased enterprise hardware. This directly impacts our ability to use the systems for their intended purpose.

Timeline Need: We need a solution within 1-2 weeks to avoid significant project delays.


Summary

Two DGX Spark systems (serial numbers [INSERT]) cannot use GPU Direct RDMA for distributed training due to nvidia-peermem module compilation issue. This appears to be caused by DGX OS 7.2.3 installing the GPU driver before MLNX_OFED, resulting in nvidia-peermem being compiled without required peer memory symbols. We need an official fix or workaround to enable this core advertised feature.

Thank you for your assistance.

Jason Brashear
Titanium Computing

For Spark RDMA details, please reference DGX Spark / GB10 FAQ