I am porting a device driver from Intel/AMD x86_64 to Jetson AGX Xavier. The driver itself already works on the Jetson using „normal“ DMA (from my card to PC RAM), but I have some trouble getting RDMA to work on the Jetson (it works on x86_64).
I modified my kernel driver source according to GPUDirect RDMA :: CUDA Toolkit Documentation
and I am able to compile the driver.
I have some questions/problems:
the above document wants me to link my kernel module to nvidia.ko, which works on x86. But I wasn‘t able to find nvidia.ko on the Jetson. Instead I found nvgpu.ko, and I was able to link to that. Is that the correct way on the Jetson?
I then run into problems when I run a small userspace program that works fine on x86_64 using RDMA:
when my kernel driver module calls nvidia_p2p_get_pages() it locks up. The function does not return, and after a while the Jetson automatically reboots.
I do not see any messages in the output of dmesg.
My userspace program allocates the buffer for RDMA using cudaHostAlloc(), as explained the above document (section 4.4.1). It passes the pointer to a library, which passes the pointer untouched to the kernel driver.
I tried different buffer sizes (64kB and 8MB), with the same result.
I also tried to use the pointer returned from cudaHostGetDevicePointer(), but that was the same as the original one.
Any idea what goes wrong here? And how I can fix it?