I have a problem with the transfer rate between FPGA-GPU when using GPUDirect RDMA.
Here is my system:
- FPGA: Virtex UltraScale+ HBM VCU128 FPGA Evaluation Kit. Using XDMA IP with Descriptor Bypass enabled and PCIe Gen3 x16
- GPU: NVIDIA A100
- Server: Super Micro A+ 4124GS-TNR
The FPGA design has been verified with a DMA transfer rate of ~ 12.5GB/s for both CPU-FPGA Read and Write.
Then, I try to transfer data between GPU-FPGA using GPUDirect RDMA API ( GPUDirect RDMA :: CUDA Toolkit Documentation (nvidia.com)). I used the “nvtop” tool to observe the transfer rate between the GPU and the FPGA. Here is the results:
FPGA to GPU: The FPGA2GPU rate is stable at ~13.5GB/s and there is no read transaction while writing
GPU to FPGA: The GPU2FPGA rate is low at ~8GB/s and there are some write transactions (~500MB/s) while reading
Parallel both: The GPU2FPGA rate is extremely low (< 1GB/s) while the FPGA2GPU rate is very high (>13GB/s)
My question is whether the observation of the difference between read and write rate in RDMA is normal or is there something wrong with my system. What could be the root cause of this problem?
I’m a newbie in RDMA. Any comments and suggestion will be appreciated.
Screen capture when running both FPGA2GPU and GPU2FPGA in parallel
Thanks a lot,
I recommend contacting the developer of the GPUDirect RDMA driver for your device for support.
The practically achievable PCIe bandwidth of GPUs is pretty much identical in both directions, and since the PCIe interconnect is full duplex, this can also be sustained for simultaneous transfers in both directions. 13 GB/sec is at the higher end of what is expected for a PCIe gen3 x16 or PCIe gen4 x8 configuration. The kind of performance drop seen in these experiments indicates that the FPGA cannot sink the data transferred across PCIe fast enough. I am not familiar with
nvtop and would suggest measuring bandwidth across PCIe using your own program so you have full control over what is being measured.
Giving that the performance of FPGA memory interfaces traditionally has been a weak spot, the observations did not appear surprising to me at first, except that according to the vendor’s hardware overview, this particular FPGA comes with 8 GB of HBM memory providing bandwidth of up to 460 GB/sec precisely to eliminate this bottleneck. There may be an FPGA memory controller or PCIe interface configuration issue in play, or an issue with the GPUDirect driver for this device. Nothing points at the GPU as the source of the issue. Check the documentation for this FPGA and its associated devkit. If that does not provide any clues, I would suggest contacting the FPGA vendor (Xilinx).
Thanks for your advice.
By observing DMA transaction on FPGA (ILA tool), I’ve figured out the problem.
The GPU2FPGA transmission can be divided into the following steps:
- At T = 100, FPGA sends 32 read request to GPU to read data (which accounts for some write transactions while reading)
- At T = 633, First GPU packet goes to FPGA (high latency when compared to CPU-FPGA, takes only 200 clock cycles - at T = 300)
- At T = 686, FPGA continues to send 33rd read request after receiving some GPU packets (probably because of Xilinx XDMA IP mechanism)
→ Due to the DMA mechanism and high latency, GPU2FPGA rate is low at ~8GB/s
For FPGA2GPU transmission, the FPGA continously sends write request and write data without waiting for any return or acknowledgement packets. Thus, FPGA2GPU rate is very high at ~13GB/s.
I think this problem is caused by the long path between GPU and FPGA. I’m going to try to use FPGA-CPU DMA instead, which have been verified with a transfer rate of ~12.5GB/s for both read and write. Then, using CudaMemcpyAsync() to copy data to GPU for processing. Hope it works!
Captured waveform for FPGA2CPU transactions with low latency and high data rate.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.