FPGA cannot communicate with A100 through XDMA Using RDMA

Hi, all:

I am developing a project that applies GPUDirect RDMA technology.
My system is:
FPGA: XCKCU15P
GPU: NVIDIA A100
Server: Lenovo SR658
System:Centos
CUDA:11.2


FPGA and A100 are mounted on the same PCIe bridge.

When the FPGA sends a read request to the GPU through XDMA, The PCIe bridge(c9:02.0) will immediately reply with an error message, which is displayed as UR, CA,CSR.

The RDMA driver was developed based on “jetson-rdma-picoevb-master”, with the script selected as “build-for-pc-native. sh”.The requested space address is 0xCC000600000, which is just within the BAR1 space of the GPU.


And the bridge also covers this part of the space.

Why? Is it because of issues with the chipset? Or is it a driver issue? I think it is correct for the driver to obtain the GPU address 0xCC000600000 in GPU BAR space by calling nvidia_p2p_get_pages and nvidia_p2p_dma_map_pages.
Can anyone help me? I would greatly appreciate it and look forward to your reply.

Thanks,
Yours Yang

The RDMA doc shows:


How to check if GPUDirect RDMA can be performed between two devices?
nvidia-smi topo -p2p w
or others?

Please help, experts. The project is a bit urgent. If it is a problem with the chipset, we will consider replacing the server. Looking forward to your reply

I don’t think this forum can give you answer just through such describe. If you want build a workable P2P RDMA driver for GPU, you need work with NVIDIA GPU expert.

And, there is an other option,

NVIDIA open source driver can support linux kernel DMA-BUF now, you can use kernel DMA-BUF access GPU memory.

https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html

Thanks for your reply.
My project needs to implement GPUDirect RDMA.
What additional information do I need to provide to determine this issue?Which forum should I mention to?
I am a newbie, please help me.
Thank you!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.