Multi-GPU contention inside CUDA

gerwin · January 10, 2023, 6:59pm

I’m having unexpected performance problems with a concurrent CUDA workload on a multi-GPU system (CUDA 11.8). The workload consists of two TensorRT engines per GPU (each TensorRT engine has its own worker thread on the CPU) and a bunch of CPU worker threads calling into NPP to resize batches of frames.

I’ve followed all guidelines for getting optimal concurrency, so: Every worker has its own dedicated CUDA stream (using nppSetStream & TensorRT) and I never synchronize the entire device. When possible, I use async operations such as cudaMemcpyAsync. No null stream usage at all. For resizing, I’m using many worker threads to push as much data to the GPU as possible. I used to do the same for TensorRT, but this causes issues as TensorRT executions interleave and latency increases linearly with the number of worker threads which causes delays in the worker threads that eventually make the entire system stall.

What I’m seeing when profiling is CUDA calls (any CUDA call) that take seconds to complete. Even simple ones such as cudaFree. Some worker threads are completely starved (deadlocked?) and never even start i.e. TensorRT engine never loads because the underlying CUDA calls are locked.

This is how it looks in Nsight:

This shows to concurrent resize operations that take more than 2 seconds (!) due to being locked internally:

Here’s a couple of allocations that take many seconds to complete (again due to internally locking):

Here’s a TensorRT execution that is being interleaved with all kinds of other operations and hardly making progress:

Meanwhile, the GPUs seem no to be busy at all:

GPU utilization in nvidia-smi hardly goes above 50% during the entire run…

I’m trying to understand what shared resource CUDA has internally that needs to be synchronized every time, and could be causing these contention issues. As far as I know, scheduling async operations should not involve any locking at all, especially not when running on their own separate stream. Hopefully someone can help me understand what exactly CUDA is doing here internally, and how I can saturate the GPU device!

EDIT: I wasn’t sure before but after going through the profiler output again I’m seeing operations on different devices blocking until some other thread, on a different stream and different device releases a lock. Could it be that CUDA or the driver has some kind of global locked resource?

njuffa · January 10, 2023, 7:55pm

I may not be up-to-date on the latest recommendations, but traditionally it was a best practice for optimal performance to use one thread to communicate with the GPU, and have all other threads communicate with this service thread. Accessing a single shared resource (like the GPU) from multiple threads needs a “conflict resolution mechanism”, usually a lock, with associated overhead for locking and unlocking.

As I said, I may not be up to date but you may wish to double-check the chosen approach versus official recommendations.

gerwin · January 11, 2023, 1:22pm

That’s interesting. I’ve changed my code to do all the preprocessing (NPP) stuff on one GPU, and the TensorRT stuff on another. Now, the resizing operations have much less contention (still using ~160 workers to push data), but the TensorRT executions on the other GPU still block for seconds from time to time! (note that I’m only using 2 workers for TensorRT, and they don’t seem to be waiting for each other).

Topic		Replies	Views
CUDA introduces heavy locks? CUDA Programming and Performance	3	1475	May 17, 2018
TensorRT unnecessary synchronization in multi-GPU system TensorRT tensorrt , performance , synchronization	7	1396	January 23, 2023
CUDA won't concurrently run kernels on multiple devices from within same process CUDA Programming and Performance cuda , performance , gpu	1	1128	January 27, 2023
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9590	January 1, 2009
Optimize - Many small operations (CPU is faster for now?) CUDA Programming and Performance	2	510	July 11, 2019
Limitations of a CUDA kernel reached? CUDA Programming and Performance	3	4325	March 7, 2011
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1452	September 14, 2017
Language confusion with multi-gpu CUDA Programming and Performance	11	19295	October 30, 2007
My first test on CUDA and some questions sync, thread with CUDA CUDA Programming and Performance	5	3023	November 13, 2007
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4490	October 24, 2008

Multi-GPU contention inside CUDA

Related topics