Boosting Inline Packet Processing Using DPDK and GPUdev with GPUs

Originally published at: https://developer.nvidia.com/blog/optimizing-inline-packet-processing-using-dpdk-and-gpudev-with-gpus/

Inline processing of network packets using GPUs is a packet analysis technique useful to a number of different applications.

This was a really interesting article. Looking at the DPDK links I noticed the only supported cards the driver supports are V100/A100 Tesla class GPU’s. Would there be any intention to change this in the future to lower spec Quadro cards? Also is this technique specific to infiniband or can it be used for vanilla Ethernet too?

I recently extended the support for more GPUs dpdk/devices.h at main · DPDK/dpdk · GitHub if your Tesla or Quadro GPU is not there please let me know and I will add it.
You can use whatever card supports GPUDirect RDMA to receive packets in GPU memory but so far this solution has been tested with ConnectX cards only.

Great - my card (A4000) is there now - thanks! I’m intending to test this with a ConnectX-5 NIC to process ethernet packets. One more question - is a multi GPU setup required for this - I am assuming for persistent kernels this is a requirement, but how about for method 4 ? I would like to try a dev setup with the A4000 card as both my display device and a packet processor.

Thank you for an interesting article. How is the data consistency issue (as described in [1]) resolved for the persistent kernel (method 3)? As I understand the GDRCopy translates to RDMA operations, but afaik ordering between RDMA ops aren’t ensured from the perspective of a concurrently running GPU kernel.

[1] GPUDirect RDMA :: CUDA Toolkit Documentation

No you don’t need a multi-GPU setup with any of the methods described in the post. As an example, in case of persistent kernel you need to tune the number of CUDA blocks (i.e. persistent kernel occupancy) to not occupy the entire GPU and have SMs available for other processing kernels

To address the data consistency issue you can use the rte_gpu_wmb() function before notifying to the running CUDA kernel that a new set of packets is ready. You can find an example here

I have the same question about the A4000. I modified the cuda.c driver to include the A4000
device id, but I’m getting a crash in the mlx5 driver (I’m using a ConnectX-4):

#0 0x00005555563b35f7 in mlx5_tx_burst_mti ()
#1 0x0000555556233c09 in rte_eth_tx_burst (nb_pkts=, tx_pkts=, queue_id=2, port_id=0)
at /home/rick/NVIDIA/l2fwd-nv/src/main.cpp:570
#2 tx_core (arg=0x2) at /home/rick/NVIDIA/l2fwd-nv/src/main.cpp:584
#3 0x0000555555763f04 in eal_thread_loop.cold () at /home/rick/NVIDIA/l2fwd-nv/src/main.cpp:757
#4 0x00007ffff7f16609 in start_thread (arg=) at pthread_create.c:477
#5 0x00007ffff7a85133 in clone () at …/sysdeps/unix/sysv/linux/x86_64/clone.S:95

I have the same question about the A4000. I modified the cuda.c driver to include the A4000
device id, but I’m getting a crash in the mlx5 driver:

#0 0x00005555563b35f7 in mlx5_tx_burst_mti ()
#1 0x0000555556233c09 in rte_eth_tx_burst (nb_pkts=, tx_pkts=, queue_id=2, port_id=0)
at /home/rick/NVIDIA/l2fwd-nv/src/main.cpp:570
#2 tx_core (arg=0x2) at /home/rick/NVIDIA/l2fwd-nv/src/main.cpp:584
#3 0x0000555555763f04 in eal_thread_loop.cold () at /home/rick/NVIDIA/l2fwd-nv/src/main.cpp:757
#4 0x00007ffff7f16609 in start_thread (arg=) at pthread_create.c:477
#5 0x00007ffff7a85133 in clone () at …/sysdeps/unix/sysv/linux/x86_64/clone.S:95

ConnectX4 is an old card and it’s been a while since I tested this solution on it. Can you build DPDK in debug mode and provide more info about the problematic line in mlx5_tx_burst_mti() function?

1 Like

Re-ran with debug enabled in dpdk lib (below).

I ordered a ConnectX-5, which should arrive next week or so, i.e. hopefully newer board.

hread 9 “lcore-worker-5” received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd7fff000 (LWP 527851)]
0x00005555567a5be4 in mlx5_tx_eseg_data (olx=83, tso=0, inlen=18, vlan=0, wqe=0x7ff7ff9a3000,
loc=0x7fffd7ff95d0, txq=0x7ff7ff9e4380) at …/drivers/net/mlx5/mlx5_tx.h:1016
1016 es->inline_data = *(unaligned_uint16_t *)psrc;
(gdb) where
#0 0x00005555567a5be4 in mlx5_tx_eseg_data (olx=83, tso=0, inlen=18, vlan=0,
wqe=0x7ff7ff9a3000, loc=0x7fffd7ff95d0, txq=0x7ff7ff9e4380)
at …/drivers/net/mlx5/mlx5_tx.h:1016
#1 mlx5_tx_burst_single_send (olx=83, loc=0x7fffd7ff95d0, pkts_n=64, pkts=0x7ff7f34a8b48,
txq=0x7ff7ff9e4380) at …/drivers/net/mlx5/mlx5_tx.h:3222
#2 mlx5_tx_burst_single (olx=83, loc=0x7fffd7ff95d0, pkts_n=64, pkts=0x7ff7f34a8b40,
txq=0x7ff7ff9e4380) at …/drivers/net/mlx5/mlx5_tx.h:3366
#3 mlx5_tx_burst_tmpl (olx=83, pkts_n=64, pkts=0x7ff7f34a8b40, txq=0x7ff7ff9e4380)
at …/drivers/net/mlx5/mlx5_tx.h:3564
#4 mlx5_tx_burst_mti (txq=0x7ff7ff9e4380, pkts=0x7ff7f34a8b40, pkts_n=64)
at …/drivers/net/mlx5/mlx5_tx_nompw.c:27
#5 0x00005555556f99e9 in rte_eth_tx_burst (nb_pkts=, tx_pkts=0x7ff7f34a8b40,
queue_id=2, port_id=0) at /home/rick/NVIDIA/l2fwd-nv/src/main.cpp:570
#6 tx_core (arg=0x2) at /home/rick/NVIDIA/l2fwd-nv/src/main.cpp:588
#7 0x000055555aa8926f in eal_thread_loop (arg=0x0) at …/lib/eal/linux/eal_thread.c:140
#8 0x00007ffff7f16609 in start_thread (arg=) at pthread_create.c:477
#9 0x00007ffff7a78133 in clone () at …/sysdeps/unix/sysv/linux/x86_64/clone.S:95

Did you disable the tx inlining when launching l2fwd-nv? As an example, -a b5:00.1,txq_inline_max=0

Yes:

$L2FWD/build/l2fwdnv -l 0-9 -n 1 -a 06:00.0,txq_inline_max=0 -a 01:00.0 – -m 1 -w 0 -b 64 -p 4 -v 0 -z 0

Interestingly, if I reduce the number of cores and the number of pipelines, the app
doesn’t segv, and seems to behave. This might be a learning curve issue, sorry for the false alarm.

Thanks for the reply. I was rather wondering how I could best ensure the consistency with pure RDMA (and not DPDK). Can I simply issue a local RDMA write from the CPU to GPU after the CPU has detected the new RDMA message on the GPU (e.g., through an RDMA completion event). When the GPU detects the flag, is it then ensured that the original RDMA message is completely consistent on the GPU? even for concurrently running GPU kernels?

Great blog post and very good sample code in l2fwd-nv with GPU. Do you have it working with DPDK/DOCA on convergent DPU with GPU? My version of DPU uses different version of DPDK so I was not sure if all dependencies will be satisfied or updated version of DPDK will soon enough make to converged DPU.

1 Like

The DPDK you find in the DPU already has this gpudev library installed so you can just use it. Anyway if you want your own DPDK version with gpudev you can just download the upstream DPDK from github and build it for arm64 on your DPU.

Thank you for this great article. btw I have a question on you posting. Form figure 13, peak I/O performance throughput for the CPU and GPU are the same. In my understanding, using CPU memory means vanilla DPDK right? If so, afaik ConnectX6-Dx with DPDK can achieve the line rate at 64-byte packets also. But, from your posting the throughput for 64-bytes packets are just below 20 Gbps… why is that?

Below is the DPDK performance report for ConnectX6-Dx.
https://fast.dpdk.org/doc/perf/DPDK_20_11_Mellanox_NIC_performance_report.pdf

Hello, may I ask why the l2fw-nv project uses a 60B digital packet as a segmentation data? Is there any basis for this?