RDMA on Jetson AGX Xavier locks up in nvidia_p2p_get_pages()

Hi everyone,

I am porting a device driver from Intel/AMD x86_64 to Jetson AGX Xavier. The driver itself already works on the Jetson using „normal“ DMA (from my card to PC RAM), but I have some trouble getting RDMA to work on the Jetson (it works on x86_64).

I modified my kernel driver source according to GPUDirect RDMA :: CUDA Toolkit Documentation
and I am able to compile the driver.

I have some questions/problems:

  1. the above document wants me to link my kernel module to nvidia.ko, which works on x86. But I wasn‘t able to find nvidia.ko on the Jetson. Instead I found nvgpu.ko, and I was able to link to that. Is that the correct way on the Jetson?

  2. I then run into problems when I run a small userspace program that works fine on x86_64 using RDMA:
    when my kernel driver module calls nvidia_p2p_get_pages() it locks up. The function does not return, and after a while the Jetson automatically reboots.
    I do not see any messages in the output of dmesg.
    My userspace program allocates the buffer for RDMA using cudaHostAlloc(), as explained the above document (section 4.4.1). It passes the pointer to a library, which passes the pointer untouched to the kernel driver.
    I tried different buffer sizes (64kB and 8MB), with the same result.
    I also tried to use the pointer returned from cudaHostGetDevicePointer(), but that was the same as the original one.
    Any idea what goes wrong here? And how I can fix it?

Best regards,

Ginsengelf

Hi Ginsengelf,

Please refer to GPUDirect RDMA on NVIDIA Jetson AGX Xavier

Thanks, but that page mainly contains the stuff that is also written in the link in my first post (or at least I didn’t find anything new)
I had a look at the picoevb RDMA example, and it seems to do the same as my code. I don’t have that hardware, so I cannot test it.
Any other ideas?

Thanks,
Ginsengelf

Hi again,

I found my problem: I had current->mm->mmap_sem locked, and apparently that caused a deadlock. After removing the down_read() call for that semaphore the lock-up after calling nvidia_p2p_get_pages() is fixed.

Ginsengelf