Boosting Inline Packet Processing Using DPDK and GPUdev with GPUs

jwitsoe · April 29, 2022, 3:39am

Originally published at: https://developer.nvidia.com/blog/optimizing-inline-packet-processing-using-dpdk-and-gpudev-with-gpus/

Inline processing of network packets using GPUs is a packet analysis technique useful to a number of different applications.

hadrj · May 10, 2022, 9:37pm

This was a really interesting article. Looking at the DPDK links I noticed the only supported cards the driver supports are V100/A100 Tesla class GPU’s. Would there be any intention to change this in the future to lower spec Quadro cards? Also is this technique specific to infiniband or can it be used for vanilla Ethernet too?

eagostini · May 11, 2022, 9:13am

I recently extended the support for more GPUs dpdk/devices.h at main · DPDK/dpdk · GitHub if your Tesla or Quadro GPU is not there please let me know and I will add it.
You can use whatever card supports GPUDirect RDMA to receive packets in GPU memory but so far this solution has been tested with ConnectX cards only.

hadrj · May 11, 2022, 5:29pm

Great - my card (A4000) is there now - thanks! I’m intending to test this with a ConnectX-5 NIC to process ethernet packets. One more question - is a multi GPU setup required for this - I am assuming for persistent kernels this is a requirement, but how about for method 4 ? I would like to try a dev setup with the A4000 card as both my display device and a packet processor.

lasse.thostrup · May 24, 2022, 3:02pm

Thank you for an interesting article. How is the data consistency issue (as described in [1]) resolved for the persistent kernel (method 3)? As I understand the GDRCopy translates to RDMA operations, but afaik ordering between RDMA ops aren’t ensured from the perspective of a concurrently running GPU kernel.

[1] GPUDirect RDMA :: CUDA Toolkit Documentation

eagostini · June 17, 2022, 9:17am

No you don’t need a multi-GPU setup with any of the methods described in the post. As an example, in case of persistent kernel you need to tune the number of CUDA blocks (i.e. persistent kernel occupancy) to not occupy the entire GPU and have SMs available for other processing kernels

eagostini · June 17, 2022, 9:22am

To address the data consistency issue you can use the rte_gpu_wmb() function before notifying to the running CUDA kernel that a new set of packets is ready. You can find an example here

rspanbauer · July 12, 2022, 8:43pm

I have the same question about the A4000. I modified the cuda.c driver to include the A4000
device id, but I’m getting a crash in the mlx5 driver (I’m using a ConnectX-4):

#0 0x00005555563b35f7 in mlx5_tx_burst_mti ()
#1 0x0000555556233c09 in rte_eth_tx_burst (nb_pkts=, tx_pkts=, queue_id=2, port_id=0)
at /home/rick/NVIDIA/l2fwd-nv/src/main.cpp:570
#2 tx_core (arg=0x2) at /home/rick/NVIDIA/l2fwd-nv/src/main.cpp:584
#3 0x0000555555763f04 in eal_thread_loop.cold () at /home/rick/NVIDIA/l2fwd-nv/src/main.cpp:757
#4 0x00007ffff7f16609 in start_thread (arg=) at pthread_create.c:477
#5 0x00007ffff7a85133 in clone () at …/sysdeps/unix/sysv/linux/x86_64/clone.S:95

rspanbauer · July 12, 2022, 8:43pm

I have the same question about the A4000. I modified the cuda.c driver to include the A4000
device id, but I’m getting a crash in the mlx5 driver:

#0 0x00005555563b35f7 in mlx5_tx_burst_mti ()
#1 0x0000555556233c09 in rte_eth_tx_burst (nb_pkts=, tx_pkts=, queue_id=2, port_id=0)
at /home/rick/NVIDIA/l2fwd-nv/src/main.cpp:570
#2 tx_core (arg=0x2) at /home/rick/NVIDIA/l2fwd-nv/src/main.cpp:584
#3 0x0000555555763f04 in eal_thread_loop.cold () at /home/rick/NVIDIA/l2fwd-nv/src/main.cpp:757
#4 0x00007ffff7f16609 in start_thread (arg=) at pthread_create.c:477
#5 0x00007ffff7a85133 in clone () at …/sysdeps/unix/sysv/linux/x86_64/clone.S:95

eagostini · July 13, 2022, 9:18am

ConnectX4 is an old card and it’s been a while since I tested this solution on it. Can you build DPDK in debug mode and provide more info about the problematic line in mlx5_tx_burst_mti() function?

rspanbauer · July 13, 2022, 3:23pm

Re-ran with debug enabled in dpdk lib (below).

I ordered a ConnectX-5, which should arrive next week or so, i.e. hopefully newer board.

hread 9 “lcore-worker-5” received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd7fff000 (LWP 527851)]
0x00005555567a5be4 in mlx5_tx_eseg_data (olx=83, tso=0, inlen=18, vlan=0, wqe=0x7ff7ff9a3000,
loc=0x7fffd7ff95d0, txq=0x7ff7ff9e4380) at …/drivers/net/mlx5/mlx5_tx.h:1016
1016 es->inline_data = *(unaligned_uint16_t *)psrc;
(gdb) where
#0 0x00005555567a5be4 in mlx5_tx_eseg_data (olx=83, tso=0, inlen=18, vlan=0,
wqe=0x7ff7ff9a3000, loc=0x7fffd7ff95d0, txq=0x7ff7ff9e4380)
at …/drivers/net/mlx5/mlx5_tx.h:1016
#1 mlx5_tx_burst_single_send (olx=83, loc=0x7fffd7ff95d0, pkts_n=64, pkts=0x7ff7f34a8b48,
txq=0x7ff7ff9e4380) at …/drivers/net/mlx5/mlx5_tx.h:3222
#2 mlx5_tx_burst_single (olx=83, loc=0x7fffd7ff95d0, pkts_n=64, pkts=0x7ff7f34a8b40,
txq=0x7ff7ff9e4380) at …/drivers/net/mlx5/mlx5_tx.h:3366
#3 mlx5_tx_burst_tmpl (olx=83, pkts_n=64, pkts=0x7ff7f34a8b40, txq=0x7ff7ff9e4380)
at …/drivers/net/mlx5/mlx5_tx.h:3564
#4 mlx5_tx_burst_mti (txq=0x7ff7ff9e4380, pkts=0x7ff7f34a8b40, pkts_n=64)
at …/drivers/net/mlx5/mlx5_tx_nompw.c:27
#5 0x00005555556f99e9 in rte_eth_tx_burst (nb_pkts=, tx_pkts=0x7ff7f34a8b40,
queue_id=2, port_id=0) at /home/rick/NVIDIA/l2fwd-nv/src/main.cpp:570
#6 tx_core (arg=0x2) at /home/rick/NVIDIA/l2fwd-nv/src/main.cpp:588
#7 0x000055555aa8926f in eal_thread_loop (arg=0x0) at …/lib/eal/linux/eal_thread.c:140
#8 0x00007ffff7f16609 in start_thread (arg=) at pthread_create.c:477
#9 0x00007ffff7a78133 in clone () at …/sysdeps/unix/sysv/linux/x86_64/clone.S:95

eagostini · July 13, 2022, 5:34pm

Did you disable the tx inlining when launching l2fwd-nv? As an example, -a b5:00.1,txq_inline_max=0

rspanbauer · July 13, 2022, 5:52pm

Yes:

$L2FWD/build/l2fwdnv -l 0-9 -n 1 -a 06:00.0,txq_inline_max=0 -a 01:00.0 – -m 1 -w 0 -b 64 -p 4 -v 0 -z 0

Interestingly, if I reduce the number of cores and the number of pipelines, the app
doesn’t segv, and seems to behave. This might be a learning curve issue, sorry for the false alarm.

lasse.thostrup · July 20, 2022, 3:17pm

Thanks for the reply. I was rather wondering how I could best ensure the consistency with pure RDMA (and not DPDK). Can I simply issue a local RDMA write from the CPU to GPU after the CPU has detected the new RDMA message on the GPU (e.g., through an RDMA completion event). When the GPU detects the flag, is it then ensured that the original RDMA message is completely consistent on the GPU? even for concurrently running GPU kernels?

azawadow · August 30, 2022, 5:38pm

Great blog post and very good sample code in l2fwd-nv with GPU. Do you have it working with DPDK/DOCA on convergent DPU with GPU? My version of DPU uses different version of DPDK so I was not sure if all dependencies will be satisfied or updated version of DPDK will soon enough make to converged DPU.

eagostini · October 12, 2022, 4:00pm

The DPDK you find in the DPU already has this gpudev library installed so you can just use it. Anyway if you want your own DPDK version with gpudev you can just download the upstream DPDK from github and build it for arm64 on your DPU.

ckjung1987 · June 1, 2023, 6:26am

Thank you for this great article. btw I have a question on you posting. Form figure 13, peak I/O performance throughput for the CPU and GPU are the same. In my understanding, using CPU memory means vanilla DPDK right? If so, afaik ConnectX6-Dx with DPDK can achieve the line rate at 64-byte packets also. But, from your posting the throughput for 64-bytes packets are just below 20 Gbps… why is that?

Below is the DPDK performance report for ConnectX6-Dx.
https://fast.dpdk.org/doc/perf/DPDK_20_11_Mellanox_NIC_performance_report.pdf

liuc115 · June 26, 2023, 2:54am

Hello, may I ask why the l2fw-nv project uses a 60B digital packet as a segmentation data? Is there any basis for this?

tuan.tran1 · October 1, 2025, 9:25pm

Hi Everyone,

This is a very interesting article. May I ask whether it would be possible to run l2fwd-nv on the Bluefield-3 card itself? Instead of splitting the mbuf header and payload between CPU main memory and GPU memory, I want the mbuf header and payload being split between SmartNIC memory and GPU memory.

Topic		Replies	Views
batch processing of packets in GPU using CUDA CUDA Programming and Performance	15	6273	March 14, 2015
Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel, CUDA Programming and Performance	35	8191	June 30, 2010
Inline GPU Packet Processing with NVIDIA DOCA GPUNetIO Technical Blog	6	932	July 23, 2025
GPUDirect RDMA support with CUDA 5 CUDA Programming and Performance	19	9246	May 28, 2013
GPGPU readback, MOVNTDQA, DPPS, drivers When will this get implemented? CUDA Programming and Performance	18	18665	March 20, 2008
PCIe Impact Give some examples of how PCIe impact your applications CUDA Programming and Performance	15	2307	October 17, 2010
GPU Communication Protocol CUDA Programming and Performance	16	6378	May 17, 2010
'Computations server' application design advice CUDA Programming and Performance	24	12772	March 23, 2007
CUDA vs DX execution times DX GPGPU code --> CUDA = slower CUDA Programming and Performance	15	13335	January 30, 2008
Kepler and Maxwell, oh my! CUDA Programming and Performance	55	55813	October 19, 2010

Boosting Inline Packet Processing Using DPDK and GPUdev with GPUs

Related topics