The goal is simple. I want to use GPU Direct RDMA at the ibverbs level. I
I do not want to use any cuda aware MPI implementation because I require a greater level of control. I want to perform transfers from a gpu to a remote Host using GPU Direct RDMA.
After rummaging through a couple of scattered posts, my understanding is that I only need to install the nv_peer_mem module. Then ibverbs can differentiate between a GPU pointer and a main memory pointer automatically and perform necessary changes. So I basically can cudaMalloc(&gpu_ptr, size) then register the gpu pointer using ibv_reg_mr and continue normally through my application.
Is this understanding correct?
My setup if that helps:
I am running a Ubuntu 18.04.4 machine with a Mellanox ConnectX-5 NIC and a V100 GPU. I have OFED 4.6, GPU Driver Version 455.32 , and CUDA Version:11.1.
I am able to run CUDA kernels without issue. And I am able to run RDMA using ibverbs without issue. My main goal is to run RDMA to perform send and receive operations using GPU memory.
Its good to confirm that there is nothing special in the code to be done and I can simply pass a GPU pointer to ibv_reg_mr.
Your suggested full test:
I went through your steps to perform the test but I get the --disable_pcie_relaxed option is not recognized. And running without it produces errors. I have removed most of the options and just used --use_cuda=0 which seems to work fine without any error.
My main test that I am trying to pull off:
I was using NCCL to test that GDR is working. Running the NCCL tests however yielded a completion status 0x4 error. So I read that I had to disable PCIe ACS which I did and now NCCL tests run fine.
And I am also able to run my own GDR code for the following scenarios:
RDMA Write
GPU β GPU (done)
CPU β GPU (done)
GPU β CPU (done)
RDMA Write Immediate
GPU β GPU (done)
CPU β GPU (done)
GPU β CPU (done)
RDMA Read
GPU β GPU (Segmentation Fault at ibv_poll_cq at receiver)
CPU β GPU (Segmentation Fault at ibv_poll_cq at receiver)
GPU β CPU (done)
Send/Receive
GPU β GPU (Segmentation Fault at ibv_poll_cq at receiver)
CPU β GPU (Segmentation Fault at ibv_poll_cq at receiver)
GPU β CPU (done)
Send/Receive with immediate
GPU β GPU (Segmentation Fault at ibv_poll_cq at receiver)
CPU β GPU (Segmentation Fault at ibv_poll_cq at receiver)
GPU β CPU (done)
β β β refers to communication across two different machines.
CPU refers to memory allocated using normal malloc and GPU refers to memory allocated using cudaMalloc.
Any Idea what may make ibv_poll_cq cause a segmentation fault for the above situations?
I have been at this for a week now so I really appreciate any help.