GPU2FPGA transfer rate is lower than FPGA2GPU when using GPUDirect RDMA

haivp · May 25, 2022, 11:06am

Hi,

I have a problem with the transfer rate between FPGA-GPU when using GPUDirect RDMA.
Here is my system:

FPGA: Virtex UltraScale+ HBM VCU128 FPGA Evaluation Kit. Using XDMA IP with Descriptor Bypass enabled and PCIe Gen3 x16
GPU: NVIDIA A100
Server: Super Micro A+ 4124GS-TNR

The FPGA design has been verified with a DMA transfer rate of ~ 12.5GB/s for both CPU-FPGA Read and Write.
Then, I try to transfer data between GPU-FPGA using GPUDirect RDMA API ( GPUDirect RDMA :: CUDA Toolkit Documentation (nvidia.com)). I used the “nvtop” tool to observe the transfer rate between the GPU and the FPGA. Here is the results:

FPGA to GPU: The FPGA2GPU rate is stable at ~13.5GB/s and there is no read transaction while writing
GPU to FPGA: The GPU2FPGA rate is low at ~8GB/s and there are some write transactions (~500MB/s) while reading

image723×67 38.5 KB
Parallel both: The GPU2FPGA rate is extremely low (< 1GB/s) while the FPGA2GPU rate is very high (>13GB/s)

My question is whether the observation of the difference between read and write rate in RDMA is normal or is there something wrong with my system. What could be the root cause of this problem?

I’m a newbie in RDMA. Any comments and suggestion will be appreciated.

Thanks,
Hai Van

haivp · May 25, 2022, 11:13am

Screen capture when running both FPGA2GPU and GPU2FPGA in parallel

Thanks a lot,
Hai Van

Robert_Crovella · May 25, 2022, 1:22pm

I recommend contacting the developer of the GPUDirect RDMA driver for your device for support.

njuffa · May 25, 2022, 4:11pm

The practically achievable PCIe bandwidth of GPUs is pretty much identical in both directions, and since the PCIe interconnect is full duplex, this can also be sustained for simultaneous transfers in both directions. 13 GB/sec is at the higher end of what is expected for a PCIe gen3 x16 or PCIe gen4 x8 configuration. The kind of performance drop seen in these experiments indicates that the FPGA cannot sink the data transferred across PCIe fast enough. I am not familiar with nvtop and would suggest measuring bandwidth across PCIe using your own program so you have full control over what is being measured.

Giving that the performance of FPGA memory interfaces traditionally has been a weak spot, the observations did not appear surprising to me at first, except that according to the vendor’s hardware overview, this particular FPGA comes with 8 GB of HBM memory providing bandwidth of up to 460 GB/sec precisely to eliminate this bottleneck. There may be an FPGA memory controller or PCIe interface configuration issue in play, or an issue with the GPUDirect driver for this device. Nothing points at the GPU as the source of the issue. Check the documentation for this FPGA and its associated devkit. If that does not provide any clues, I would suggest contacting the FPGA vendor (Xilinx).

haivp · May 27, 2022, 5:01pm

Hi njuffa,

Thanks for your advice.

By observing DMA transaction on FPGA (ILA tool), I’ve figured out the problem.

The GPU2FPGA transmission can be divided into the following steps:

At T = 100, FPGA sends 32 read request to GPU to read data (which accounts for some write transactions while reading)
At T = 633, First GPU packet goes to FPGA (high latency when compared to CPU-FPGA, takes only 200 clock cycles - at T = 300)
At T = 686, FPGA continues to send 33rd read request after receiving some GPU packets (probably because of Xilinx XDMA IP mechanism)
→ Due to the DMA mechanism and high latency, GPU2FPGA rate is low at ~8GB/s

For FPGA2GPU transmission, the FPGA continously sends write request and write data without waiting for any return or acknowledgement packets. Thus, FPGA2GPU rate is very high at ~13GB/s.

I think this problem is caused by the long path between GPU and FPGA. I’m going to try to use FPGA-CPU DMA instead, which have been verified with a transfer rate of ~12.5GB/s for both read and write. Then, using CudaMemcpyAsync() to copy data to GPU for processing. Hope it works!

Thanks,
Hai Van

haivp · May 27, 2022, 5:03pm

Captured waveform for FPGA2CPU transactions with low latency and high data rate.

Thanks,
Hai Van

system · June 10, 2022, 5:04pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Low performance on V100 to/from RDMA device CUDA Programming and Performance cuda , kernel	4	694	September 28, 2020
Questions on GPUs for software-defined radios CUDA Programming and Performance	2	3023	February 23, 2016
GPU Communication Protocol CUDA Programming and Performance	16	6267	May 17, 2010
P2P DMA performance limitation? where a single CPU is invoked CUDA Programming and Performance	3	1619	November 27, 2017
Slow Memory Copies CUDA Programming and Performance	7	1182	November 6, 2018
RDMA GPU Direct Slow CUDA Programming and Performance	10	2431	February 13, 2019
Low Aggregate PCI Bandwidth for 9800GX2 CUDA Programming and Performance	14	22133	September 16, 2008
Device Memory Bandwidth CUDA Programming and Performance	17	8210	January 17, 2018
FPGA cannot communicate with A100 through XDMA Using RDMA RDMA Software For GPU rdmaroce-solutions	5	319	June 12, 2024
PCIe DMA transfer performance issue with custom FPGA board on Jetson TX2 Jetson TX2 pcie , kernel , fpga	2	944	July 12, 2022

GPU2FPGA transfer rate is lower than FPGA2GPU when using GPUDirect RDMA

Related topics