Mellanox OFED GPUDirect RDMA for AGX Xavier

Colin · April 21, 2021, 2:23am

I’m not sure if AGX kernel 4.9.140-tegra supports OFED GPUDirect RDMA but I followed this link anyway Mellanox OFED GPUDirect RDMA

I’m able to successfully install MLNX_OFED package as required by the user manual. Then I downloaded GPUDirect RDMA package nvidia-peer-memory_1.1.tar.gz, but then when I tried to build nv_peer_mem I got the following error:

DKMS make.log for nvidia-peer-memory-1.1 for kernel 4.9.140-tegra (aarch64)
Tue Apr 20 22:12:02 EDT 2021
INFO: Building with MLNX_OFED from: /usr/src/ofa_kernel/default
/var/lib/dkms/nvidia-peer-memory/1.1/build/create_nv.symvers.sh 4.9.140-tegra
-E- Cannot locate nvidia modules!
CUDA driver must be installed before installing this package!
Makefile:91: recipe for target ‘gen_nv_symvers’ failed
make: *** [gen_nv_symvers] Error 1

I have verified CUDA driver and devel packages are all installed. I’m not sure what I was missing.

kayccc · April 21, 2021, 3:20am

That’s not support with Jetson AGX Xavier, but for NVIDIA® Tesla™ / Quadro K-Series or Tesla™ / Quadro™ P-Series GPU only.

ahepner · December 7, 2021, 4:19pm

I’ve been working on a port for the Jetson AGX somewhat works.
still crashs because of some smmu erros.
it’s still a pre-Alpha , and a lot of things are still hardcoded, but at least it compiles.
https://github.com/ah-iai/nv_peer_memory/tree/1_1_0_release_Jetson
please report any issues, so we may take out of pre-alpha and upstream it for all the community to enjoy.

sunlingyu · July 11, 2023, 9:43am

Hi, I think this might be helpful for anyone who comes here and meet the same issue:

I have managed to customized nv_peer_memory for jetson orin. The compiled nv_peer_memory kernel moduled has been proven to be working by running test tool ib_write_bw with GPU Direct RDMA enabled (set flag use_cuda). The git repo of the modified version is here. Please feel free to use it.

One thing to mention is that the source code of ib_write_bw tool (which is part of mellanox perftest tool) needs to be slightly modifed to support jetson. The cuMemAlloc function call must be replaced by cuMemAllocHost+cuMemHostGetDevicePointer function call, according to the official guide of porting GPU Direct RDMA code to Jetson.

Topic		Replies	Views
Problem installing nvidia-peer-memory: Error! Bad return status for module build on kernel: 4.15.0-161-generic (aarch64) Jetson AGX Xavier networking	4	1455	December 2, 2021
Problem installing nvidia_peer_memory kernel driver on Ubuntu CUDA Setup and Installation	3	4243	December 21, 2017
Issues with nv-p2p.h and nvidia-peermem for Jetpack 6.0 Jetson AGX Orin cuda	7	286	November 19, 2024
Rivermax & GPUDirect Network Management Products gpu , inception , rivermax	5	2479	October 6, 2022
How to use nvidia-peermem? Jetson AGX Orin cuda	8	697	March 10, 2025
Ndivia peer mem - Error: : Could not insert 'nv_peer_mem': Invalid argument Network Management Products gpu , rivermax	2	2212	June 14, 2023
Using nv_peer_mem for GPUDirect with JetPack 6 Jetson AGX Orin gpu	2	48	October 23, 2024
ibv_reg_mr got file exists error when used nv_peer_mem	2	442	September 9, 2017
Installation of nvidia-peermem for multi-node GPUDirect RDMA GPU-Accelerated Libraries	0	931	April 27, 2023
GPU Direct RDMA Help CUDA Programming and Performance	4	1448	November 22, 2020

Mellanox OFED GPUDirect RDMA for AGX Xavier

Related topics