We are looking to use RDMA over Infiniband for high-speed, low-latency data transfer between separate computer nodes for real-time analytics during manufacturing. We will be using RDMA zero copy mostly just for transferring data from the sensors to analysis nodes and then back to the manufacturing devices. The general analysis runtime will be in C# with CUDA functions being loaded in via C++ shared libraries. We will be using this Nvidia Infiniband tech: on Ubuntu 22.04 LTS:
-
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition - 96GB GDDR7
-
NVIDIA MCX75310AAS-NEAT ConnectX-7 adapter card
-
MMA4Z00-NS400 Compatible 400GBASE-SR4 OSFP Flat Top PAM4 850nm 50m DOM MPO-12/APC MMF InfiniBand NDR Optical Transceiver Module
-
MFP7E10-N030 Compatible 30m (98ft) MTP®-12 APC (Female) to MTP®-12 APC (Female), 8 Fibers, Multimode, Magenta
-
MQM9700-NS2F - Managed - NVIDIA Quantum-2 WM9700 400G Infiniband Switch
When looking through the RDMA documentation, there were multiple different libraries, so I was curious what would be the most up-to-date and current way to easily handle RDMA programming with these questions:
-
Does the RTX 6000 and Ubuntu 24 LTS support DMA-BUF for RDMA compared to Peerman?
-
What modules and packages are needed to be installed for DMA-BUF and RDMA with NCCL on Ubuntu?
-
How much latency would be added by having a single library which locks, allocates, transfers, and frees memory as a single unit? This library would be called separately from each transfer.
-
Can GDRCopy be used as a drop-in replacement for cudamalloc in existing CUDA C++ code to reduce latency in moving data from GPU to CPU and vice versa?
-
What is the latency on using GDR copy to move data from GPU to CPU, do some processing, and sending it back to the GPU for further processing and transfer via NCCL RDMA?
-
Do these modules work with a real-time Linux kernel?
-
Will we continue to get NVIDIA Driver updates even if we stay on extended long-term Ubuntu support?
-
If we have two GPUs (these are both the exact same model listed above) in the analysis system, would it be better to reserve one GPU exclusively to reduce latency?