RDMA Questions

Hello NVIDIA developers,

I want to test RDMA on my GPUs. Before I do some experiments, I want to ask 2 questions.

  1. I have NVIDIA A16 GPU and a NIC, both of which connects to PCIe. Can I perform RDMA between them?

  2. I currently install a open-source kernel module to control my GPU. But based on the manual (1. Overview — GPUDirect RDMA 12.3 documentation), I must modify the module to perform RDMA. Do we have some available examples?

  3. If I successfully modify the kernel module and my GPU is available for RDMA, can you provide some CUDA application examples? I can use them to test RDMA.

Sincerely,
irakatz

1.Depend on your system, GPU and HCA need on same PCIE root, disable IOMMU and PCIE ACSCtl.

2.I don’t think need modify, but you need use CUDA11.4 above better. There is kernel driver for GDR nv_peer_mem.ko on that.

3.GDR is simple, you just need use cuMemAlloc alloc GPU memory then use ibv_reg_mr register rdma mr.

There is CUDA manual for GDR,

Thanks for your reply.

Now I re-install an open-source kernel module (GitHub - NVIDIA/open-gpu-kernel-modules: NVIDIA Linux open GPU kernel module source, version 525.147.05, and my CUDA is 12.0). During the installation, I find a compiled “nvidia-peermem.ko” here, but it is not installed.

However, when I manually run sudo insmod nvidia-peermem.ko, it says insmod: ERROR: could not insert module nvidia-peermem.ko: Invalid parameters

How to solve this problem?

You can try modprobe.

And there is systemd service should be there, nvidia-peermem etc, you can check by “systemctl list-units --type=service”

And, if you installed MOFED, there is another same module, nv_peer_mem.ko, same with nvidia-peermem, one use one is OK.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.