So I have been trying to enable and use GPU Direct RDMA these last few days however I’m pretty lost.
I have looked at https://docs.nvidia.com/cuda/gpudirect-rdma/index.html but there is really no section that explains how to install or enable GPU Direct RDMA support.
Only the following sentence “To add GPUDirect RDMA support to a device driver, a small amount of address mapping code within the kernel driver must be modified. This code typically resides near existing calls to get_user_pages().”
Not sure if I’m missing something from the documentation but I am still very confused as to how install/enable GPU Direct RDMA.
I looked at https://www.mellanox.com/products/GPUDirect-RDMA which only shows us the system requirements and offers us a link to download nvidia-peer-memory_1.1.tar.gz. Should I just download and install this file? The page does not provide any instruction.
Furthermore, lets assume that I did enable GPU Direct. How do I go about using it? can I just cudaMalloc and then register the memory region in ibverbs and proceed as normal?
I’m not sure if I’m missing some basic knowledge and if what I need to do is obvious to most people or not. It certainly isn’t for me.
I would appreciate any explanations/tips/links/tutorials.
Appreciate your time.
My setup if that helps:
I am running a Ubuntu 18.04.4 machine with a Mellanox ConnectX-5 NIC and a V100 GPU. I have OFED 4.6, GPU Driver Version 455.32 , and CUDA Version:11.1.
I am able to run CUDA kernels without issue. And I am able to run RDMA using ibverbs without issue. My main goal is to run RDMA to perform send and receive operations using GPU memory.