Considering the code structure, here are my debug information:
When working with 1 or 2 buffers of 1GB+, the application runs as expected.
When working with more than two buffers, the kernel stucks and the log/serial debug output is here (6.6 KB)
If I remove only the cudaFreeHost, I can work with 2+ buffers, but I think it’s a bad practice, correct? Also, if I remove the pin/unpin memory, and keep only the cudaHostAlloc and cudaFreeHost, the kernel doesn’t crash.
What am I doing wrong? Is my flow correct? How can I work with 2+ buffers without issues and freeing the allocated resources.
Environment: Jetson AGX Orin | LT4 36.3 | nv_peer_mem module (deprecated) | cuda 12.2
Yes, I had checked… my code is basically a summarized copy of the picoevb-rdma. The driver runs as expected when I’m working with a single buffer (any size), as well as the example. But if I try a few buffers (like 10 buffers of 1KB+), the kernel crashes and the log is attached.
Do you know if the driver was tested to multiple buffers? The error is returning from nvidia_p2p_put_pages/mmu_notifier.c:805. And yes, the sequence (get/put pages, map/unmap) is the same of the picoevb-rdma.