GPU Direct RDMA Help

So I have been trying to enable and use GPU Direct RDMA these last few days however I’m pretty lost.
I have looked at https://docs.nvidia.com/cuda/gpudirect-rdma/index.html but there is really no section that explains how to install or enable GPU Direct RDMA support.

Only the following sentence “To add GPUDirect RDMA support to a device driver, a small amount of address mapping code within the kernel driver must be modified. This code typically resides near existing calls to get_user_pages().”

Not sure if I’m missing something from the documentation but I am still very confused as to how install/enable GPU Direct RDMA.

I looked at https://www.mellanox.com/products/GPUDirect-RDMA which only shows us the system requirements and offers us a link to download nvidia-peer-memory_1.1.tar.gz. Should I just download and install this file? The page does not provide any instruction.

Furthermore, lets assume that I did enable GPU Direct. How do I go about using it? can I just cudaMalloc and then register the memory region in ibverbs and proceed as normal?

I’m not sure if I’m missing some basic knowledge and if what I need to do is obvious to most people or not. It certainly isn’t for me.

I would appreciate any explanations/tips/links/tutorials.

Appreciate your time.

My setup if that helps:
I am running a Ubuntu 18.04.4 machine with a Mellanox ConnectX-5 NIC and a V100 GPU. I have OFED 4.6, GPU Driver Version 455.32 , and CUDA Version:11.1.
I am able to run CUDA kernels without issue. And I am able to run RDMA using ibverbs without issue. My main goal is to run RDMA to perform send and receive operations using GPU memory.

GPUDirect RDMA is primarily used to transfer data directly from the memory of a GPU in machine A to the memory of a GPU (or possibly some other device) in machine B.

If you only have 1 GPU, or only 1 machine, GPUDirect RDMA may be irrelevant.

The typical way to use GPUDirect RDMA in a multi-machine setup is to:

  1. Install Mellanox OFED
  2. Build/install a communication library such as NCCL or MPI (for MPI, build cuda-aware MPI)
  3. Profit!

In the general case, you can use GPUDirect RDMA to transfer data directly from a non-GPU device (such as an FPGA, or networking adapter) to GPU memory. This requires device driver development, and the device driver development instructions begin with the the link you indicated.

Thanks for the quick response!

Yeah of course I have multiple machines, they have the same setup as described above.
And I’m actually looking to work on a lower level than MPI or NCCL since our research group is working on its own collective communication library so to speak.

I was under the impression that I had to install a special plugin (nv_peer_mem) or do some manual code edits for GPU Direct RDMA to work. Is that not the case? I can just do cudaMalloc and register the memory region in ibverbs and work as I normally would with ibverbs?

NCCL is open-source. So are various MPI installations. If you follow the steps necessary to enable either NCCL or MPI, you should be able to write another communication library on that foundation.

So if I were going down this path, the first thing I would do is get CUDA-aware MPI or NCCL up and running. There should be various instructions for that in various places on the web.

Then, it should be possible to learn how to create a communication library by studying either of those examples.

I see. I was hoping for a more straightforward approach but I guess that is unavailable.
Thanks for the help.