Kernel is crashing for a GPUDirect application

Hi!

Firstly, my driver contains the methods accessed by IOCTL:

  • Driver pin_memory = nvidia_p2p_get_pages → nvidia_p2p_dma_map_pages
  • Driver unpin_memory = nvidia_p2p_dma_unmap_pages → nvidia_p2p_put_pages → callback (nvidia_p2p_free_page_table) + some "kfree"s.

And the application follows the sequence:

  1. open fd
  2. cudaHostAlloc
  3. cuPointerSetAttribute
  4. pin_memory from IOCTL
  5. unpin_memory from IOCTL
  6. cudaFreeHost
  7. close fd

Considering the code structure, here are my debug information:

  1. When working with 1 or 2 buffers of 1GB+, the application runs as expected.
  2. When working with more than two buffers, the kernel stucks and the log/serial debug output is here (6.6 KB)
  3. If I remove only the cudaFreeHost, I can work with 2+ buffers, but I think it’s a bad practice, correct? Also, if I remove the pin/unpin memory, and keep only the cudaHostAlloc and cudaFreeHost, the kernel doesn’t crash.

What am I doing wrong? Is my flow correct? How can I work with 2+ buffers without issues and freeing the allocated resources.

Environment: Jetson AGX Orin | LT4 36.3 | nv_peer_mem module (deprecated) | cuda 12.2

Hi,

Have you checked our sample for Jetson in the below link?

If the sample doesn’t help, could you share your source code with us so we can know more about your implementation?

Thanks.

Hi!

Yes, I had checked… my code is basically a summarized copy of the picoevb-rdma. The driver runs as expected when I’m working with a single buffer (any size), as well as the example. But if I try a few buffers (like 10 buffers of 1KB+), the kernel crashes and the log is attached.

Do you know if the driver was tested to multiple buffers? The error is returning from nvidia_p2p_put_pages/mmu_notifier.c:805. And yes, the sequence (get/put pages, map/unmap) is the same of the picoevb-rdma.

Hi,

We need to check this with our internal team.
Will get back to you later.

Thanks.

Hi,

We test 16 buffer with each has 256K and RDMA can work correctly.
Could you double-check it again?

If the issue goes on, could you share your code snippet with us?
Thanks.