I’m having unexpected performance problems with a concurrent CUDA workload on a multi-GPU system (CUDA 11.8). The workload consists of two TensorRT engines per GPU (each TensorRT engine has its own worker thread on the CPU) and a bunch of CPU worker threads calling into NPP to resize batches of frames.
I’ve followed all guidelines for getting optimal concurrency, so: Every worker has its own dedicated CUDA stream (using nppSetStream & TensorRT) and I never synchronize the entire device. When possible, I use async operations such as cudaMemcpyAsync
. No null stream usage at all. For resizing, I’m using many worker threads to push as much data to the GPU as possible. I used to do the same for TensorRT, but this causes issues as TensorRT executions interleave and latency increases linearly with the number of worker threads which causes delays in the worker threads that eventually make the entire system stall.
What I’m seeing when profiling is CUDA calls (any CUDA call) that take seconds to complete. Even simple ones such as cudaFree
. Some worker threads are completely starved (deadlocked?) and never even start i.e. TensorRT engine never loads because the underlying CUDA calls are locked.
This is how it looks in Nsight:
This shows to concurrent resize operations that take more than 2 seconds (!) due to being locked internally:
Here’s a couple of allocations that take many seconds to complete (again due to internally locking):
Here’s a TensorRT execution that is being interleaved with all kinds of other operations and hardly making progress:
Meanwhile, the GPUs seem no to be busy at all:
GPU utilization in nvidia-smi hardly goes above 50% during the entire run…
I’m trying to understand what shared resource CUDA has internally that needs to be synchronized every time, and could be causing these contention issues. As far as I know, scheduling async operations should not involve any locking at all, especially not when running on their own separate stream. Hopefully someone can help me understand what exactly CUDA is doing here internally, and how I can saturate the GPU device!
EDIT: I wasn’t sure before but after going through the profiler output again I’m seeing operations on different devices blocking until some other thread, on a different stream and different device releases a lock. Could it be that CUDA or the driver has some kind of global locked resource?