Originally published at: https://developer.nvidia.com/blog/unlocking-gpu-accelerated-rdma-with-nvidia-doca-gpunetio/
NVIDIA DOCA GPUNetIO is a library within the DOCA SDK, specifically designed for real-time inline GPU packet processing. It combines technologies like GPUDirect RDMA and GPUDirect Async to enable the creation of GPU-centric applications where a CUDA kernel can directly communicate with the network interface card (NIC) for sending and receiving packets, bypassing the CPU…
For fig3,why the latency is so high ? 64 bytes for 100 us and 4096 bytes for nearly 600 us? I use CX-5 in perftest only takes 10us with RDMA.
The two Dell R750 machines I used for the benchmarks don’t have the best PCIe topology for applications using GPUDirect: the H100 and the ConnectX-7 are connected to 2 diffent PCIe slots on different NUMA nodes.
I will provide more benchmarks in the future on other plaftorms “GPUDirect friendly”.
Please consider that it’s out of the scope of this blog post to show the best performance perftest can achieve.
The goal is to show that DOCA GPUNetIO RDMA performance is in-line with the well-known perftest CPU RDMA code even in case of an “unconvenient” system topology.
NVIDIA DOCA GPUNetIO, a library within the NVIDIA DOCA SDK, empowers real-time inline GPU packet processing. By combining technologies like GPUDirect RDMA and GPUDirect Async, it allows direct communication between a GPU CUDA kernel and the network interface card (NIC), bypassing the CPU. Now, with DOCA 2.7, it even supports RDMA communications directly from the GPU using RoCE or InfiniBand transport layers.
@ jwitsoe
As shown in the figure, only one cuda kernel can call rdma_commit at the same time. If the cuda kernel sets multiple blocks, can thread 0 in each block call this interface in parallel? Thank you !