Unlocking GPU-Accelerated RDMA with NVIDIA DOCA GPUNetIO

jwitsoe · June 13, 2024, 8:44pm

Originally published at: https://developer.nvidia.com/blog/unlocking-gpu-accelerated-rdma-with-nvidia-doca-gpunetio/

NVIDIA DOCA GPUNetIO is a library within the DOCA SDK, specifically designed for real-time inline GPU packet processing. It combines technologies like GPUDirect RDMA and GPUDirect Async to enable the creation of GPU-centric applications where a CUDA kernel can directly communicate with the network interface card (NIC) for sending and receiving packets, bypassing the CPU…

xiekangpei · June 25, 2024, 2:57am

For fig3，why the latency is so high ? 64 bytes for 100 us and 4096 bytes for nearly 600 us? I use CX-5 in perftest only takes 10us with RDMA.

eagostini · June 25, 2024, 11:40am

The two Dell R750 machines I used for the benchmarks don’t have the best PCIe topology for applications using GPUDirect: the H100 and the ConnectX-7 are connected to 2 diffent PCIe slots on different NUMA nodes.
I will provide more benchmarks in the future on other plaftorms “GPUDirect friendly”.

Please consider that it’s out of the scope of this blog post to show the best performance perftest can achieve.
The goal is to show that DOCA GPUNetIO RDMA performance is in-line with the well-known perftest CPU RDMA code even in case of an “unconvenient” system topology.

attaullakhan607 · June 27, 2024, 7:47am

NVIDIA DOCA GPUNetIO, a library within the NVIDIA DOCA SDK, empowers real-time inline GPU packet processing. By combining technologies like GPUDirect RDMA and GPUDirect Async, it allows direct communication between a GPU CUDA kernel and the network interface card (NIC), bypassing the CPU. Now, with DOCA 2.7, it even supports RDMA communications directly from the GPU using RoCE or InfiniBand transport layers.

1437907055 · March 27, 2025, 2:22am

@ jwitsoe

As shown in the figure, only one cuda kernel can call rdma_commit at the same time. If the cuda kernel sets multiple blocks, can thread 0 in each block call this interface in parallel? Thank you !

Topic		Replies	Views
Inline GPU Packet Processing with NVIDIA DOCA GPUNetIO Technical Blog	4	894	October 15, 2023
Realizing the Power of Real-Time Network Processing with NVIDIA DOCA GPUNetIO Technical Blog	2	514	April 23, 2024
Benchmarking GPUDirect RDMA on Modern Server Platforms Technical Blog	40	2747	April 11, 2019
Boosting Inline Packet Processing Using DPDK and GPUdev with GPUs Technical Blog	17	1898	June 26, 2023
Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async Technical Blog	3	635	March 10, 2025
Deploying GPUDirect RDMA on EGX Stack with the Mellanox Network Operator Technical Blog	0	456	September 30, 2020
Accelerating Solution Development with DOCA on NVIDIA BlueField DPUs Technical Blog	0	385	April 12, 2021
GPUDirect Storage: A Direct Path Between Storage and GPU Memory Technical Blog	7	1086	March 22, 2022
Programming the Entire Data Center Infrastructure with the NVIDIA DOCA SDK Technical Blog	5	770	January 19, 2021
Method to Cycle steal DMA write into DDR5 CUDA Programming and Performance	7	902	December 8, 2017

Unlocking GPU-Accelerated RDMA with NVIDIA DOCA GPUNetIO

Related topics