I’m facing this issue for a long time, so let’s go! I’m developing a linux device driver using GPUDirect RDMA in Cuda 12.2. Following the documentation, the nv-peer-mem was deprecated due to a race condition, and the new driver is the nvidia-peermem.
I accepted to use the nv-peer-mem, but I think that I achieved the bug… I’m having a inconstant behavior that stucks my kernel some times in nvidia_p2p_put_pages (nv-peer-mem).
So, please, How can I use the nvidia-peermem?
Environment: Jetpack 36.3 - Tegra 5.15.136 - Cuda 12.2
The section “changes in Cuda 11.4” reports that the module should be loaded manually. Do I still need to manually load for Cuda 12.2?
All the documentation I accessed, there is the note:
“If the NVIDIA GPU driver is installed before MOFED, the GPU driver must be uninstalled and installed again to make sure nvidia-peermem is compiled with the RDMA APIs that are provided by MOFED.”
If I try to only modprobe the nvidia-peermem, a fatal error is returned: Module nvidia-peermem not found… It isn’t clear if I need to install anything or the module is native.
If necessary to uninstall and install the GPU drivers, should it be done using sdkmanager?
Is there a example of using nvidia-peermem? Is there any documentation about the module? I can’t access any information about it, I don’t even know the parameters of the methods.
Hi! Thanks a lot @AastaLLL.
Yes, the config required in r36 is to be used with the deprecated module nv_peer_mem. I’m trying to use the new module nvidia-peermem. The nv-peer-mem was running, but I’m suspecting that the module is resulting in a wrong and inconstant behavior.
Actually, I can’t test anything. Theoretically, I would insert the module and change the API to persistent API. However, I can’t insert the nvidia-peermem, because it isn’t installed. And I can’t install because the guide is not very clear. How to install in the Jetson…