GPUDirect Write with transfer size under 256 Bytes

Hello all,

I’ve created some custom performance tests for RDMA as well as GPUDirect and everything works fine except for one issue:
If the client has allocated GPU memory and wants to transfer via RDMA Write a subsection of its memory smaller than 256 bytes to the remote memory it fails with a segmentation fault. However, the same tests with main memory instead of GPU memory work fine (regardless of which memory type the remote uses). In general, all tests with at least 256 bytes transfer sizes work. The allocated (and registered) memory block is always multiple times bigger than the transferred section so that shouldn’t be the cause of the segmentation fault. Also, it doesn’t matter if the client and server are on the same node (localhost) or on separate hardware.
I pinned down the occurrence of the fault to the ibv_post_send() call.

Summary:
The client that has allocated GPU memory and wants to transfer under 256 bytes via RDMA Write generates a segmentation fault.

Question:
Does someone have an idea why this behavior occurs and maybe how I can get it fixed? Thank you very much in advance!

System:

  • OS: Ubuntu 18.04.1 LTS
  • CPU: Intel® Xeon® Gold 5120 CPU @ 2.20GHz
  • GPUs: 2x Tesla V100-SXM2-16GB (Driver V.418.67, CUDA V.10.1)
  • NICs: 2x Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
    2x Mellanox MT27800 Family [ConnectX-5]
  • PCI: PCI 3.0 (Motherboard), RDMA NICs connected via PCIeGen3 x16
  • IB: SB7890 InfiniBand EDR 100Gb/s Switch System

I figured it out by myself and just want to share my solution here for others.
For GPUDirect to work with smaller message sizes than 256 bytes I needed to use the CUDA driver API instead of the CUDA runtime API.
Have a look here how the IB PerfTool was built:

Best regards,
Luca