GPUDirect RDMA at the ibverbs level.

The goal is simple. I want to use GPU Direct RDMA at the ibverbs level. I

I do not want to use any cuda aware MPI implementation because I require a greater level of control. I want to perform transfers from a gpu to a remote Host using GPU Direct RDMA.

After rummaging through a couple of scattered posts, my understanding is that I only need to install the nv_peer_mem module. Then ibverbs can differentiate between a GPU pointer and a main memory pointer automatically and perform necessary changes. So I basically can cudaMalloc(&gpu_ptr, size) then register the gpu pointer using ibv_reg_mr and continue normally through my application.

Is this understanding correct?

My setup if that helps:

I am running a Ubuntu 18.04.4 machine with a Mellanox ConnectX-5 NIC and a V100 GPU. I have OFED 4.6, GPU Driver Version 455.32 , and CUDA Version:11.1.

I am able to run CUDA kernels without issue. And I am able to run RDMA using ibverbs without issue. My main goal is to run RDMA to perform send and receive operations using GPU memory.

Thanks for the help


Right, to use rdma/gpu memory, you only need to register the GPU memory by ibv_reg_mem.

You can find different sample of perftest package using flag --cuda to see how it is implemented.

Google “perftest cuda”



Here a full test


rm -rf /etc/perftest

cd /etc

git clone

cd perftest


./configure CUDA_PATH=/hpc/local/oss/cuda10.2/cuda-toolkit/ CUDA_H_PATH=/hpc/local/oss/cuda10.2/cuda-toolkit/include/cuda.h


make install

The output of the (GPU) memory allocation should look as below.

[root@l-csi-1123s gdr]# ib_write_bw -d mlx5_0 -x 3 --tclass=96 --report_gbits --run_infinitely --disable_pcie_relaxed --CPU-freq --use_cuda=0

  • Waiting for client to connect… *

initializing CUDA

Listing all CUDA devices in system:

CUDA device 0: PCIe address is 1C:00

CUDA device 1: PCIe address is 41:00

Picking device No. 0

[pid = 54188, dev = 0] device name = [Tesla V100S-PCIE-32GB]

creating CUDA Ctx

making it the current CUDA Ctx

cuMemAlloc() of a 131072 bytes GPU buffer

allocated GPU buffer address at 00007fb9dfa00000 pointer=0x7fb9dfa00000

Thanks for the prompt reply.

Its good to confirm that there is nothing special in the code to be done and I can simply pass a GPU pointer to ibv_reg_mr.

Your suggested full test:

I went through your steps to perform the test but I get the --disable_pcie_relaxed option is not recognized. And running without it produces errors. I have removed most of the options and just used --use_cuda=0 which seems to work fine without any error.

My main test that I am trying to pull off:

I was using NCCL to test that GDR is working. Running the NCCL tests however yielded a completion status 0x4 error. So I read that I had to disable PCIe ACS which I did and now NCCL tests run fine.

And I am also able to run my own GDR code for the following scenarios:

  • RDMA Write

  • GPU → GPU (done)

  • CPU → GPU (done)

  • GPU → CPU (done)

  • RDMA Write Immediate

  • GPU → GPU (done)

  • CPU → GPU (done)

  • GPU → CPU (done)

  • RDMA Read

  • GPU → GPU (Segmentation Fault at ibv_poll_cq at receiver)

  • CPU → GPU (Segmentation Fault at ibv_poll_cq at receiver)

  • GPU → CPU (done)

  • Send/Receive

  • GPU → GPU (Segmentation Fault at ibv_poll_cq at receiver)

  • CPU → GPU (Segmentation Fault at ibv_poll_cq at receiver)

  • GPU → CPU (done)

  • Send/Receive with immediate

  • GPU → GPU (Segmentation Fault at ibv_poll_cq at receiver)

  • CPU → GPU (Segmentation Fault at ibv_poll_cq at receiver)

  • GPU → CPU (done)

’ → ’ refers to communication across two different machines.

CPU refers to memory allocated using normal malloc and GPU refers to memory allocated using cudaMalloc.

Any Idea what may make ibv_poll_cq cause a segmentation fault for the above situations?

I have been at this for a week now so I really appreciate any help.


Please open a new case to NVIDIA Networking support.