I am developing a project that applies GPUDirect RDMA technology.
My system is:
FPGA: XCKCU15P
GPU: NVIDIA A100
Server: Lenovo SR658
System:Centos
CUDA:11.2
When the FPGA sends a read request to the GPU through XDMA, The PCIe bridge(c9:02.0) will immediately reply with an error message, which is displayed as UR, CA,CSR.
The RDMA driver was developed based on “jetson-rdma-picoevb-master”, with the script selected as “build-for-pc-native. sh”.The requested space address is 0xCC000600000, which is just within the BAR1 space of the GPU.
Why? Is it because of issues with the chipset? Or is it a driver issue? I think it is correct for the driver to obtain the GPU address 0xCC000600000 in GPU BAR space by calling nvidia_p2p_get_pages and nvidia_p2p_dma_map_pages.
Can anyone help me? I would greatly appreciate it and look forward to your reply.
Please help, experts. The project is a bit urgent. If it is a problem with the chipset, we will consider replacing the server. Looking forward to your reply
I don’t think this forum can give you answer just through such describe. If you want build a workable P2P RDMA driver for GPU, you need work with NVIDIA GPU expert.
And, there is an other option,
NVIDIA open source driver can support linux kernel DMA-BUF now, you can use kernel DMA-BUF access GPU memory.
Thanks for your reply.
My project needs to implement GPUDirect RDMA.
What additional information do I need to provide to determine this issue?Which forum should I mention to?
I am a newbie, please help me.
Thank you!