GPU Direct RDMA Help

OmarAlama · November 22, 2020, 4:22pm

So I have been trying to enable and use GPU Direct RDMA these last few days however I’m pretty lost.
I have looked at GPUDirect RDMA :: CUDA Toolkit Documentation but there is really no section that explains how to install or enable GPU Direct RDMA support.

Only the following sentence “To add GPUDirect RDMA support to a device driver, a small amount of address mapping code within the kernel driver must be modified. This code typically resides near existing calls to get_user_pages().”

Not sure if I’m missing something from the documentation but I am still very confused as to how install/enable GPU Direct RDMA.

I looked at Mellanox OFED GPUDirect RDMA which only shows us the system requirements and offers us a link to download nvidia-peer-memory_1.1.tar.gz. Should I just download and install this file? The page does not provide any instruction.

Furthermore, lets assume that I did enable GPU Direct. How do I go about using it? can I just cudaMalloc and then register the memory region in ibverbs and proceed as normal?

I’m not sure if I’m missing some basic knowledge and if what I need to do is obvious to most people or not. It certainly isn’t for me.

I would appreciate any explanations/tips/links/tutorials.

Appreciate your time.

My setup if that helps:
I am running a Ubuntu 18.04.4 machine with a Mellanox ConnectX-5 NIC and a V100 GPU. I have OFED 4.6, GPU Driver Version 455.32 , and CUDA Version:11.1.
I am able to run CUDA kernels without issue. And I am able to run RDMA using ibverbs without issue. My main goal is to run RDMA to perform send and receive operations using GPU memory.

Robert_Crovella · November 22, 2020, 6:18pm

GPUDirect RDMA is primarily used to transfer data directly from the memory of a GPU in machine A to the memory of a GPU (or possibly some other device) in machine B.

If you only have 1 GPU, or only 1 machine, GPUDirect RDMA may be irrelevant.

The typical way to use GPUDirect RDMA in a multi-machine setup is to:

Install Mellanox OFED
Build/install a communication library such as NCCL or MPI (for MPI, build cuda-aware MPI)
Profit!

In the general case, you can use GPUDirect RDMA to transfer data directly from a non-GPU device (such as an FPGA, or networking adapter) to GPU memory. This requires device driver development, and the device driver development instructions begin with the the link you indicated.

OmarAlama · November 22, 2020, 6:25pm

Thanks for the quick response!

Yeah of course I have multiple machines, they have the same setup as described above.
And I’m actually looking to work on a lower level than MPI or NCCL since our research group is working on its own collective communication library so to speak.

I was under the impression that I had to install a special plugin (nv_peer_mem) or do some manual code edits for GPU Direct RDMA to work. Is that not the case? I can just do cudaMalloc and register the memory region in ibverbs and work as I normally would with ibverbs?

Robert_Crovella · November 22, 2020, 6:29pm

NCCL is open-source. So are various MPI installations. If you follow the steps necessary to enable either NCCL or MPI, you should be able to write another communication library on that foundation.

So if I were going down this path, the first thing I would do is get CUDA-aware MPI or NCCL up and running. There should be various instructions for that in various places on the web.

Then, it should be possible to learn how to create a communication library by studying either of those examples.

OmarAlama · November 22, 2020, 7:36pm

I see. I was hoping for a more straightforward approach but I guess that is unavailable.
Thanks for the help.

Topic		Replies	Views
GPUDirect RDMA at the ibverbs level. Software And Drivers iterations , bytes	4	1440	November 30, 2020
Exploring GPUDirect on a Local Area network Teaching and Curriculum Support	1	1167	September 22, 2013
Using GPUDirect RDMA under OpenCL CUDA Programming and Performance	2	1501	August 7, 2024
Using libVMA with GPUDirect RDMA RDMA Software For GPU kb	2	974	January 28, 2023
Rivermax & GPUDirect Network Management Products gpu , inception , rivermax	5	2419	October 6, 2022
Is there a reference implementation for direct NIC-to-GPU data transfer? RDMA Software For GPU	1	751	September 24, 2015
Can I use GPUDirect with VMA for an UDP multicast Software And Drivers	1	419	February 2, 2020
GPUDirect RDMA support with CUDA 5 CUDA Programming and Performance	19	9156	May 28, 2013
Is there any where I can download the sample code for GPUDirect RDMA? RDMA Software For GPU	3	1270	February 12, 2016
GPUDirect RDMA - use Jetson's DMA Jetson AGX Orin gpu-computing	15	381	July 12, 2024

GPU Direct RDMA Help

Related topics