P2P DMA performance limitation? where a single CPU is invoked

Please give us any suggestions, if you have any hint/information/ideas.

I’m now concerning about peer-to-peer DMA performance (throughput) where 1xGPU and 3xSSDs are installed in a uni-processor system.

M/B: Supermicro 5018GR-T
CPU: Intel Xeon E5-2650v4 x1
GPU: NVIDIA Tesla P40 x1
SSD: Intel DC P4600(2.0TB) x3

This system has 1x PCIe x16 slot + 3x PCIe x8 slots; all of them are directly connected to the CPU.
This SSD (DC P4600) has 3.2GB/s SeqRead performance, thus, 9.6GB/s is expected as a hardware limitation when we construct md-raid0(striping) volume with triple SSDs. In fact, it recorded about 9.4-9.5GB/s for SSD-to-RAM test.

On the other hands, I could not pull up SSD-to-GPU P2P DMA performance more than 7.1GB/s.

NVIDIA documentation for GPUDirect RDMA says:

  • PCIe switches only
  • single CPU/IOH
  • CPU/IOH <-> QPI/HT <-> CPU/IOH

The first situation, where there are only PCIe switches on the path, is optimal and yields the best performance. The second one, where a single CPU/IOH is involved, works, but yields worse performance ( especially peer-to-peer read bandwidth has been shown to be severely limited on some processor architectures ).

Do you think 7.1GB/s is adequate performance for peer-to-peer DMA intermediate by a single CPU processor?
I don’t have any other processor than Xeon E5-2650v4, so unavailable to compare other models/generations.

In case of dual SSDs configuration, its SSD2GPU P2P DMA reported 6.3GB/s throughput.
Also, single SSD configuration reported 3.2GB/s throughput.
It is less feasible that SSD’s controller is poor, I think.


Is it certain that RDMA between SSDs and GPU are actually taking place? I was under the impression that RDMA with the GPU only takes place if the driver for the device specifically supports it, and only few device drivers incorporate such support.

Your observations would square with the hypothesis that SSDs and GPU actually communicate through system memory here.

What program was used to measure the throughput numbers? Some sort of standard I/O benchmark, or something written in-house?

Yes, I’m working on both of application and kernel driver which intermediate data transfer from NVME-SSD to GPU.


What our kernel driver does is quite simple. It construct NVME READ command structure from the specified data blocks on SSD to the device memory of GPU already mapped using GPUDirect RDMA. This command structure is pushed to command-queue of the inbox nvme driver.
Eventually, this READ commands are written to the special control register of nvme device, then controller of the NVME SSD processes the SSD-to-GPU DMA according to the command.

utils/ssd2gpu_test and utils/ssd2ram_test in the above repository.
These are thin wrapper of the kernel driver. It set up READ request to the specified file in multithreads,
SSD2RAM performs with catalog spec at least, application side inputs enough READ commands.

So, I suspect hardware limitations, especially performance of the Broadwell-EP CPU to route P2P DMA packet on PCIe bus.

Someone knows any information about this feature?

If you are crafting your own driver you certainly know more about RDMA than I do. I assume, given the packetized nature of PCIe, that you have already looked into transfer rate as a function of transfer block size? Based on observation, PCIe transfers to the GPU do not reach maximum throughput until transfer size reaches 16 MB.