RDMA on Jetson AGX Xavier locks up in nvidia_p2p_get_pages()

Ginsengelf · January 26, 2021, 12:56pm

Hi everyone,

I am porting a device driver from Intel/AMD x86_64 to Jetson AGX Xavier. The driver itself already works on the Jetson using „normal“ DMA (from my card to PC RAM), but I have some trouble getting RDMA to work on the Jetson (it works on x86_64).

I modified my kernel driver source according to GPUDirect RDMA :: CUDA Toolkit Documentation
and I am able to compile the driver.

I have some questions/problems:

the above document wants me to link my kernel module to nvidia.ko, which works on x86. But I wasn‘t able to find nvidia.ko on the Jetson. Instead I found nvgpu.ko, and I was able to link to that. Is that the correct way on the Jetson?
I then run into problems when I run a small userspace program that works fine on x86_64 using RDMA:
when my kernel driver module calls nvidia_p2p_get_pages() it locks up. The function does not return, and after a while the Jetson automatically reboots.
I do not see any messages in the output of dmesg.
My userspace program allocates the buffer for RDMA using cudaHostAlloc(), as explained the above document (section 4.4.1). It passes the pointer to a library, which passes the pointer untouched to the kernel driver.
I tried different buffer sizes (64kB and 8MB), with the same result.
I also tried to use the pointer returned from cudaHostGetDevicePointer(), but that was the same as the original one.
Any idea what goes wrong here? And how I can fix it?

Best regards,

Ginsengelf

kayccc · January 26, 2021, 11:08pm

Hi Ginsengelf,

Please refer to GPUDirect RDMA on NVIDIA Jetson AGX Xavier

Ginsengelf · January 29, 2021, 12:47pm

Thanks, but that page mainly contains the stuff that is also written in the link in my first post (or at least I didn’t find anything new)
I had a look at the picoevb RDMA example, and it seems to do the same as my code. I don’t have that hardware, so I cannot test it.
Any other ideas?

Thanks,
Ginsengelf

Ginsengelf · February 2, 2021, 7:30am

Hi again,

I found my problem: I had current->mm->mmap_sem locked, and apparently that caused a deadlock. After removing the down_read() call for that semaphore the lock-up after calling nvidia_p2p_get_pages() is fixed.

Ginsengelf

Topic		Replies	Views
Issues porting desktop RDMA app to Tegra: mmap hangs kernel Jetson AGX Xavier cuda	11	1480	April 1, 2022
I have a few questions about GPU Direct RDMA Jetson AGX Xavier cuda , kernel	4	745	December 1, 2022
GPUdirect RDMA in Jetson Xavier - cudaHostAlloc() Jetson AGX Xavier cuda	2	430	January 5, 2022
Jetson-rdma-picoevb not working for PC build Jetson AGX Xavier	5	471	August 16, 2023
GPUDirect RDMA - Module can not be insert into kernel Jetson AGX Orin pcie , kernel , nvbugs	27	4299	November 2, 2022
GPUDirect RDMA on NVIDIA Jetson AGX Xavier Technical Blog	1	834	June 12, 2019
How to use memory allocated by dma_alloc_coherent() in cuda Jetson TX2 cuda	6	918	October 18, 2021
Jetson AGX Xavier Issue with xrdp Jetson AGX Xavier networking	2	656	May 16, 2023
GPUDirect RDMA on JetPack 5.0.0gp Jetson Xavier NX gpu	2	615	July 5, 2022
GPUDirect RDMA on Jetson Orin (nvidia_p2p_dma_map_pages) Jetson AGX Orin gpu	13	2580	November 16, 2022

RDMA on Jetson AGX Xavier locks up in nvidia_p2p_get_pages()

Related topics