Multi-GPU contention inside CUDA

I’m having unexpected performance problems with a concurrent CUDA workload on a multi-GPU system (CUDA 11.8). The workload consists of two TensorRT engines per GPU (each TensorRT engine has its own worker thread on the CPU) and a bunch of CPU worker threads calling into NPP to resize batches of frames.

I’ve followed all guidelines for getting optimal concurrency, so: Every worker has its own dedicated CUDA stream (using nppSetStream & TensorRT) and I never synchronize the entire device. When possible, I use async operations such as cudaMemcpyAsync. No null stream usage at all. For resizing, I’m using many worker threads to push as much data to the GPU as possible. I used to do the same for TensorRT, but this causes issues as TensorRT executions interleave and latency increases linearly with the number of worker threads which causes delays in the worker threads that eventually make the entire system stall.

What I’m seeing when profiling is CUDA calls (any CUDA call) that take seconds to complete. Even simple ones such as cudaFree. Some worker threads are completely starved (deadlocked?) and never even start i.e. TensorRT engine never loads because the underlying CUDA calls are locked.

This is how it looks in Nsight:

This shows to concurrent resize operations that take more than 2 seconds (!) due to being locked internally:

image

Here’s a couple of allocations that take many seconds to complete (again due to internally locking):

Here’s a TensorRT execution that is being interleaved with all kinds of other operations and hardly making progress:

image

Meanwhile, the GPUs seem no to be busy at all:

image

GPU utilization in nvidia-smi hardly goes above 50% during the entire run…

I’m trying to understand what shared resource CUDA has internally that needs to be synchronized every time, and could be causing these contention issues. As far as I know, scheduling async operations should not involve any locking at all, especially not when running on their own separate stream. Hopefully someone can help me understand what exactly CUDA is doing here internally, and how I can saturate the GPU device!

EDIT: I wasn’t sure before but after going through the profiler output again I’m seeing operations on different devices blocking until some other thread, on a different stream and different device releases a lock. Could it be that CUDA or the driver has some kind of global locked resource?

I may not be up-to-date on the latest recommendations, but traditionally it was a best practice for optimal performance to use one thread to communicate with the GPU, and have all other threads communicate with this service thread. Accessing a single shared resource (like the GPU) from multiple threads needs a “conflict resolution mechanism”, usually a lock, with associated overhead for locking and unlocking.

As I said, I may not be up to date but you may wish to double-check the chosen approach versus official recommendations.

That’s interesting. I’ve changed my code to do all the preprocessing (NPP) stuff on one GPU, and the TensorRT stuff on another. Now, the resizing operations have much less contention (still using ~160 workers to push data), but the TensorRT executions on the other GPU still block for seconds from time to time! (note that I’m only using 2 workers for TensorRT, and they don’t seem to be waiting for each other).