Mellanox OFED GPUDirect RDMA for AGX Xavier

I’m not sure if AGX kernel 4.9.140-tegra supports OFED GPUDirect RDMA but I followed this link anyway Mellanox OFED GPUDirect RDMA

I’m able to successfully install MLNX_OFED package as required by the user manual. Then I downloaded GPUDirect RDMA package nvidia-peer-memory_1.1.tar.gz, but then when I tried to build nv_peer_mem I got the following error:

DKMS make.log for nvidia-peer-memory-1.1 for kernel 4.9.140-tegra (aarch64)
Tue Apr 20 22:12:02 EDT 2021
INFO: Building with MLNX_OFED from: /usr/src/ofa_kernel/default
/var/lib/dkms/nvidia-peer-memory/1.1/build/create_nv.symvers.sh 4.9.140-tegra
-E- Cannot locate nvidia modules!
CUDA driver must be installed before installing this package!
Makefile:91: recipe for target ‘gen_nv_symvers’ failed
make: *** [gen_nv_symvers] Error 1

I have verified CUDA driver and devel packages are all installed. I’m not sure what I was missing.

That’s not support with Jetson AGX Xavier, but for NVIDIA® Tesla™ / Quadro K-Series or Tesla™ / Quadro™ P-Series GPU only.

1 Like

I’ve been working on a port for the Jetson AGX somewhat works.
still crashs because of some smmu erros.
it’s still a pre-Alpha , and a lot of things are still hardcoded, but at least it compiles.
https://github.com/ah-iai/nv_peer_memory/tree/1_1_0_release_Jetson
please report any issues, so we may take out of pre-alpha and upstream it for all the community to enjoy.

Hi, I think this might be helpful for anyone who comes here and meet the same issue:

I have managed to customized nv_peer_memory for jetson orin. The compiled nv_peer_memory kernel moduled has been proven to be working by running test tool ib_write_bw with GPU Direct RDMA enabled (set flag use_cuda). The git repo of the modified version is here. Please feel free to use it.

One thing to mention is that the source code of ib_write_bw tool (which is part of mellanox perftest tool) needs to be slightly modifed to support jetson. The cuMemAlloc function call must be replaced by cuMemAllocHost+cuMemHostGetDevicePointer function call, according to the official guide of porting GPU Direct RDMA code to Jetson.