DOCA gpu_packet_processing runtime error

Hi - I’m trying to run the gpu_packet_processing example described in the DOCA GPUNetIO Install Page.

I have an x86_64 system with an NVIDIA RTX A4500 Embedded GPU and a ConnectX-7 NIC:

 +-[0000:14]-+-00.0  Intel Corporation Device 09a2
 |           +-00.1  Intel Corporation Device 09a4
 |           +-00.2  Intel Corporation Device 09a3
 |           +-00.3  Intel Corporation Device 09a5
 |           +-00.4  Intel Corporation Device 0998
 |           \-02.0-[15-1a]----00.0-[16-1a]--+-00.0-[17]--+-00.0  Mellanox Technologies MT2910 Family [ConnectX-7]
 |                                           |            \-00.1  Mellanox Technologies MT2910 Family [ConnectX-7]
 |                                           \-01.0-[18-1a]----00.0-[19-1a]----00.0-[1a]--+-00.0  NVIDIA Corporation Device 24fa
 |                                                                                        \-00.1  NVIDIA Corporation GA104 High Definition Audio Controller

Per the guide, I’ve set up the ConnectX adapter in Ethernet mode, disabled ACS in BIOS, enabled resizeable BAR1, and set up hugepages:

# lspci -s 17:00.0 -vvvv | grep -i ACSCtl
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
# nvidia-smi -q | grep BAR1 -A 3
    BAR1 Memory Usage
        Total                             : 16384 MiB
        Used                              : 2 MiB
        Free                              : 16382 MiB
# grep -i huge /proc/meminfo
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       4
HugePages_Free:        3
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:         4194304 kB

If I run the app, here’s what I get:

# ./doca_gpu_packet_processing -n 17:00.0 -g 1a:00.0 -q 1 -l 70 --sdk-log-level 70
[00:53:58:046419][1204][DOCA][INF][gpu_packet_processing.c:284][main] ===========================================================
[00:53:58:046473][1204][DOCA][INF][gpu_packet_processing.c:285][main] DOCA version: 2.7.0085
[00:53:58:046478][1204][DOCA][INF][gpu_packet_processing.c:286][main] ===========================================================
[00:53:58:046505][1204][DOCA][INF][gpu_packet_processing.c:307][main] Options enabled:
        GPU 1a:00.0
        NIC 17:00.0
        GPU Rx queues 1
        GPU HTTP server enabled No
[00:53:58:131521][1204][DOCA][INF][doca_dev.cpp:578][doca_devinfo_create_list] Devinfo list 0x6144d829e288: Added device=0x6144d829d3c0 to devinfo list
[00:53:58:131541][1204][DOCA][INF][doca_dev.cpp:578][doca_devinfo_create_list] Devinfo list 0x6144d829e288: Added device=0x6144d829e500 to devinfo list
[00:53:58:131547][1204][DOCA][INF][doca_dev.cpp:587][doca_devinfo_create_list] Devinfo list 0x6144d829e288 was created
[00:53:58:133411][1204][DOCA][INF][doca_dev.cpp:1003][doca_dev_open] Local device 0x6144d829d3c0 was opened
[00:53:58:133423][1204][DOCA][INF][doca_dev.cpp:146][dev_put] Device 0x6144d829e500 was destroyed
[00:53:58:133436][1204][DOCA][INF][doca_dev.cpp:668][doca_devinfo_destroy_list] Devinfo list 0x6144d829e288 was destroyed
EAL: Detected CPU lcores: 20
EAL: Detected NUMA nodes: 1
EAL: Detected shared linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'PA'
EAL: VFIO support initialized
TELEMETRY: No legacy callbacks, legacy socket not created
EAL: Probe PCI driver: mlx5_pci (15b3:1021) device: 0000:17:00.0 (socket 0)
EAL: Driver cannot attach the device (1a:00.0)
EAL: Failed to attach device on primary process
[00:53:58:286841][1204][DOCA][ERR][doca_gpunetio.cpp:151][doca_gpu_create] Failed to hotplug DPDK GPU device 1a:00.0 (-95)
[00:53:58:286857][1204][DOCA][ERR][gpu_packet_processing.c:332][main] Function doca_gpu_create returned Resource initialization failure

The driver doesn’t seem to be probing, but I’m not sure why… any ideas or suggestions would be helpful!

Somehow the underlying DPDK system doesn’t like the GPU. This is a problem that will be resolved soon in the next releases.

May I ask you to send the output of nvidia-smi and lspci -s 1a:00.0 -n ?

Here is the output of nvidia-smi

Tue Jun 18 15:14:27 2024
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA RTX A4500 Embedde...    Off |   00000000:1A:00.0 Off |                  Off |
| N/A   30C    P8              8W /   50W |       2MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |

| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|  No running processes found                                                             |

Here is the output of lspci -s 1a:00.0 -n

1a:00.0 0300: 10de:24fa (rev a1)

This is an unfortunate miss: at the moment, DOCA GPUNetIO relies on DPDK GPU driver for basic device initialization. DPDK requires a list of GPU IDs and this ID is not in the list. I will resolve completely this problem in future DOCA releases but for now, I can guarantee that for the next DOCA 2.8 release this ID will be in the list.

Thanks for getting back to us! We look forward to the next release.

In the meantime, I attempted adding a patch to our container build that adds our device to the list of supported devices in DPDK:

--- dpdk-stable-23.11.1.orig/drivers/gpu/cuda/cuda.c    2024-05-16 23:46:11.000000000 -0700
+++ dpdk-stable-23.11.1/drivers/gpu/cuda/cuda.c 2024-06-18 11:20:56.875980541 -0700
@@ -221,6 +221,10 @@
+                               NVIDIA_GPU_QUADRO_RTX_A4500E)
+       },
+       {
diff -Naur dpdk-stable-23.11.1.orig/drivers/gpu/cuda/devices.h dpdk-stable-23.11.1/drivers/gpu/cuda/devices.h
--- dpdk-stable-23.11.1.orig/drivers/gpu/cuda/devices.h 2024-05-16 23:46:11.000000000 -0700
+++ dpdk-stable-23.11.1/drivers/gpu/cuda/devices.h      2024-06-18 11:20:42.700013196 -0700
@@ -58,6 +58,7 @@
 #define NVIDIA_GPU_QUADRO_RTX_A6000 (0x2230)
 #define NVIDIA_GPU_QUADRO_RTX_A5000 (0x2231)
 #define NVIDIA_GPU_QUADRO_RTX_A4500 (0x2232)
+#define NVIDIA_GPU_QUADRO_RTX_A4500E (0x24fa)
 #define NVIDIA_GPU_QUADRO_RTX_A5500 (0x2233)
 #define NVIDIA_GPU_QUADRO_RTX_A2000 (0x2531)
 #define NVIDIA_GPU_QUADRO_RTX_A2000_12GB (0x2571)

However, running the app yields precisely the same error as the top-post. As far as you can tell, is there anything obviously wrong with how I’ve added the device to DPDK?

Thank you!