Rivermax & GPUDirect

Hello,

We are adding direct support for GPUDirect on our Rivermax-based SMPTE ST 2110 sender on x86.
It was failing to create a flow whenever we allocated memory using the GPU, so we decided to test the generic_receiver demo:

sudo ./generic_receiver -i 10.10.1.10 -m 239.1.1.1 -p 2000 -s 10.10.1.10 -g 0

And it also fails with the following error:

(…)
#########################################

Rivermax library version: 12.2.10.23

Application version: 12.2.10.23

#########################################
(…)
CUDA memory allocation on GPU - cuMemCreate
RDMA is not supported or not enabled, status = 0 val = 0
Error: Fail to Allocate GPU Payload memory
(…)

It seems like we need to enable RDMA but we cannot find documentation that clearly explains how to do it.
The module should already be installed, according to the 5th warning here: https://docs.nvidia.com/networking/display/GPUDirectRDMAv18/Installation:

"GPUDirect RDMA kernel mode support is now provided in the form of a fully open source nvidia-peermem kernel module, that is installed as part of the NVIDIA driver. The nvidia_peermem module is a drop-in replacement for nv_peer_mem.

This simplifies the installation workflow for our customers, so that there is no longer a need to retrieve and build code from a separate site. Now, simply installing the driver will suffice.

Please refer to nvidia_peermem documentation for more information."

We are using Driver Version: 515.65.01, CUDA Version: 11.7 and, in fact, the module is there:

bisect@dolores /o/m/r/1/apps> lsmod | grep nvidia_peermem
nvidia_peermem 16384 0
nvidia 40816640 1068 nvidia_uvm,nvidia_peermem,nvidia_modeset
ib_core 397312 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm

Anyway, we tried installing nvidia-peer-memory_1.1, but it fails:

DKMS: install completed.
Building initial module for 5.15.0-48-generic
Secure Boot not enabled on this system.
Done.

nv_peer_mem.ko:
Running module version sanity check.

  • Original module
    • No original module exists within this kernel
  • Installation
    • Installing to /lib/modules/5.15.0-48-generic/updates/dkms/

depmod…

DKMS: install completed.
modprobe: ERROR: could not insert ‘nv_peer_mem’: Invalid argument
dpkg: error processing package nvidia-peer-memory-dkms (–install):
installed nvidia-peer-memory-dkms package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
nvidia-peer-memory-dkms

=====
dmesg:
[1646355.231123] nv_peer_mem: module uses symbols from proprietary module nvidia, inheriting taint.
[1646355.231188] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[1646355.231191] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)

System info:

bisect@dolores /o/m/r/1/apps> lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.5 LTS
Release: 20.04
Codename: focal

bisect@dolores /o/n/nvidia-peer-memory-1.1> apt search ofed
Sorting… Done
Full Text Search… Done
hping3/focal 3.a2.ds2-9 amd64
Active Network Smashing Tool

mlnx-ofed-kernel-dkms/now 5.5-OFED.5.5.1.0.3.1 all [installed,local]
DKMS support for mlnx-ofed kernel modules

mlnx-ofed-kernel-utils/now 5.5-OFED.5.5.1.0.3.1 amd64 [installed,local]
Userspace tools to restart and tune mlnx-ofed kernel modules

mlnx-tools/now 5.2.0-0.55103 amd64 [installed,local]
Userspace tools to restart and tune MLNX_OFED kernel modules

ofed-scripts/now 5.5-OFED.5.5.1.0.3 amd64 [installed,local]
MLNX_OFED utilities

bisect@dolores /o/m/r/1/apps> lspci | egrep ‘Mell|NV’
25:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
25:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
26:00.0 VGA compatible controller: NVIDIA Corporation Device 2507 (rev a1)
26:00.1 Audio device: NVIDIA Corporation Device 228e (rev a1)

#########################################

Rivermax library version: 12.2.10.23

Application version: 12.2.10.23

#########################################

VMA INFO: ---------------------------------------------------------------------------
VMA INFO: VMA_VERSION: 9.4.0-1 Release built on Oct 5 2021 11:18:30
VMA INFO: Cmd Line: ./generic_receiver -i 10.10.1.10 -m 239.1.1.1 -p 2000 -s 10.10.1.10 -g 0
VMA INFO: OFED Version: mlnx-en-5.5-1.0.3.2:
VMA INFO: ---------------------------------------------------------------------------
VMA INFO: Log Level INFO [VMA_TRACELEVEL]
VMA INFO: Tx Mem Segs TCP 4 [VMA_TX_SEGS_TCP]
VMA INFO: Tx Mem Bufs 256 [VMA_TX_BUFS]
VMA INFO: Tx QP WRE 128 [VMA_TX_WRE]
VMA INFO: Tx Prefetch Bytes 32 [VMA_TX_PREFETCH_BYTES]
VMA INFO: Rx Mem Bufs 256 [VMA_RX_BUFS]
VMA INFO: Rx QP WRE 128 [VMA_RX_WRE]
VMA INFO: Rx Prefetch Bytes 32 [VMA_RX_PREFETCH_BYTES]
VMA INFO: Force Flowtag for MC Enabled [VMA_MC_FORCE_FLOWTAG]
VMA INFO: CQ AIM Max Count 64 [VMA_CQ_AIM_MAX_COUNT]
VMA INFO: ---------------------------------------------------------------------------

bisect@dolores /o/m/r/1/apps> nvidia-smi
Wed Sep 21 11:52:53 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … On | 00000000:26:00.0 On | N/A |
| 0% 60C P2 38W / 130W | 346MiB / 8192MiB | 2% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

bisect@dolores /o/m/r/1/apps> lsmod | grep nvidia_peermem
nvidia_peermem 16384 0
nvidia 40816640 1068 nvidia_uvm,nvidia_peermem,nvidia_modeset
ib_core 397312 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm

Any help is much appreciated.