We have our FPGA PCIe endpoint connected to a Jetson Orin AGX. With little drama we see the full expected bandwidth of the PCIe bus between the FPGA and the Orin’s shared memory. Our starting point is a DPDK application moving data between the FPGA and Linux hugepages in the ARM user space. With the Gen4x8 link, this gives nearly 12.5 GB/s in each direction, full duplex. This is our functional baseline.
We understand that GPUDirect RDMA has historically been a peer-to-peer construct between two PCIe endpoints. Can someone please confirm (or contradict) that GPUDirect is equally at home when an endpoint device needs to move data to/from a region of the Orin shared memory directly accessible by its Cuda cores?
We now want to move data directly between the FPGA and GPU memory buffers; so that a copy is not required. It feels like GPUDirect RDMA is the “Nvidia” way to accomplish this. While we’ve read in this forum of this being done with FPGAs, the details are sketchy. We have carefully read this document:
It doesn’t explain the sequence of PCIe TLPs needed for the RDMA messages and doorbells. Is there an Nvidia document that details this? Or does GPUDirect RDMA rely on some other normative spec; like the way RoCE relies on IB ? Any pointers are greatly appreciated.