This might be a bit of a longshot but I’m struggling to figure out how else to proceed with troubleshooting.
Here’s what I’m facing:
-
I have an IVA container based on DeepStream 6.2. It makes use of a custom YOLO v8 model and consumes RTSP streams.
-
The IVA is deployed via Helm onto a local K8s cluster. Single node.
-
On my dev box, which is running a 3060 Ti (plenty of RAM and a good CPU), the IVA works great. E.g. 4 concurrent streams processing at around 30 FPS.
-
The dev box is installed with native Driver Version: 525.105.17 CUDA Version: 12.0 (i.e. the driver is installed on the machine itself, not via the K8s cluster), with Cloud Native Core installed manually (v7.0); this is NOT Cloud Native Stack. OS is Ubuntu 22.04 LTS.
-
On another box, which is running an A2 (w/ even more RAM and CPU power), the same IVA deployed in the same way runs at ~13 FPS for maybe 30-45s before the FPS drops to 0.
-
The big differences here are that the A2 box runs a K8s cluster setup by an Ansible playbook for Cloud Native Stack (I’ve tried both v9.0 and v10.0); the NVIDIA driver is NOT installed locally, i.e. is entirely handled by the GPU Operator, but the driver version is the same with the exception of running CUDA 12.1. OS is also Ubuntu 22.04 LTS.
I’d figure that the A2 running the latest Cloud Native Stack would have no problems running the IVA since it runs perfectly on the 3060 Ti. I recognize that the supporting K8s cluster is a bit different, but I can’t think of (or see) a reason why the performance would differ so wildly.
Running a single stream on the A2 box works just fine, but obviously that isn’t much help.
Can anyone recommend what else I can check, or have an idea as to why this is happening? The A2 GPU looks to be in good health and the output of nvidia-smi shows the GPU is running just fine.
Thank you in advance and I apologize for the fairly vague topic.
ETA: The component latencies on the dev box running the 3060 Ti are an order of magnitude faster than what’s occurring on the A2 box. Same container.