TensorRT unnecessary synchronization in multi-GPU system

gerwin · January 19, 2023, 1:02pm

Description

In a multi-GPU system, when using multiple TensorRT engines over multiple threads (one thread per GPU, with one engine each) performance drops significantly due to some synchronization being done inside TensorRT that should not be necessary. The exact same workload when split out over multiple processes (one process per GPU, one engine each) is much faster.

This behavior is problematic since the only way to get good performance out of TensorRT in a multi-GPU system is by having a multi-process architecture, which suffers from IPC overhead. Mainly, I’m wondering exactly what TensorRT is locking on in the single-process case. Since each thread gets its own GPU, there shouldn’t be any locking necessary as far as I understand.

Environment

TensorRT Version: 8.5.2.2
GPU Type: 10x RTX A4000
Nvidia Driver Version: 525.60
CUDA Version: 11.8
CUDNN Version: 8.7.0
Operating System + Version: Ubuntu 22.04.1 LTS
Baremetal or Container (if container which image + tag): Baremetal

Relevant Files

Nsight profiling excerpt when using one process per GPU:

Runtime here is between 10 and 20 milliseconds.

Same profiling excerpt when using one thread per GPU (all in the same process):

Runtime varies a lot but anywhere between 200ms and 1000 milliseconds.

Steps To Reproduce

Bad performance case

In a multi-GPU system with N GPUs, start N threads.
Bind each thread to the corresponding GPU.
Load the TensorRT engine in each thread.
Run the model in a loop.

Note average runtime.

Good performance case

In a multi-GPU system with N GPUs, start N processes.
Bind each process to the corresponding GPU and set CUDA_VISIBLE_DEVICES accordingly.
Load the TensorRT engine in each process.
Run the model in a loop.

Note average runtime. It will be much lower than the previous case. Even though the only difference is using processes vs. threads.

EDIT: Added repro below!

gerwin · January 19, 2023, 2:33pm

For reproducibility purposes, I’ve modified sampleOnnxMNIST and was able to see the same behavior.

(Fixed this post after initial repro wasn’t good!)

Modified sampleOnnxMNIST

Find attached the changed sampleOnnxMNIST.cpp. I made the following changes:

Instead of running the MNIST model just once, this model runs it 1000 times to warm up, and then another 1000 times.
The second run is timed and the average inference duration is calculated.
Instead of binding to the default device, this modified version creates a thread for each device, and launches inference at the same time to get maximum overlap between computations of each thread.

Testing

No concurrency at all

To test the expected infer time without any multi-GPU usage, use the sample like this:

CUDA_VISIBLE_DEVICES=0 CUDA_MODULE_LOADING=LAZY ./sample_onnx_mnist

This starts a single worker and runs the model 1000 times. The result:

LAUNCH!
worker: 0 - mean infer runtime (microseconds): 45.881
DONE

Multi-GPU, threading

On a multi-GPU system, run the modified sample like this:

CUDA_MODULE_LOADING=LAZY ./sample_onnx_mnist

On my machine (10x RTX A4000), the result is:

LAUNCH!
worker: 9 - mean infer runtime (microseconds): 567.273
worker: 2 - mean infer runtime (microseconds): 725.276
worker: 6 - mean infer runtime (microseconds): 790.605
worker: 3 - mean infer runtime (microseconds): 828.389
worker: 8 - mean infer runtime (microseconds): 834.065
worker: 4 - mean infer runtime (microseconds): 835.557
worker: 5 - mean infer runtime (microseconds): 832.867
worker: 0 - mean infer runtime (microseconds): 826.183
worker: 7 - mean infer runtime (microseconds): 822.379
worker: 1 - mean infer runtime (microseconds): 816.8
DONE

Note that the average runtime is about 20x worse.

You can see the locking behavior in the trace:

Note the constant calls pthread_mutex_lock from within TensorRT and the fact that the GPUs are NOT saturated at all.

Multi-GPU, multi-process

Run the sample like this to simulate one process per GPU:

CUDA_VISIBLE_DEVICES=0 CUDA_MODULE_LOADING=LAZY ./sample_onnx_mnist & \
CUDA_VISIBLE_DEVICES=1 CUDA_MODULE_LOADING=LAZY ./sample_onnx_mnist & \
CUDA_VISIBLE_DEVICES=2 CUDA_MODULE_LOADING=LAZY ./sample_onnx_mnist & \
CUDA_VISIBLE_DEVICES=3 CUDA_MODULE_LOADING=LAZY ./sample_onnx_mnist & \
CUDA_VISIBLE_DEVICES=4 CUDA_MODULE_LOADING=LAZY ./sample_onnx_mnist & \
CUDA_VISIBLE_DEVICES=5 CUDA_MODULE_LOADING=LAZY ./sample_onnx_mnist & \
CUDA_VISIBLE_DEVICES=6 CUDA_MODULE_LOADING=LAZY ./sample_onnx_mnist & \
CUDA_VISIBLE_DEVICES=7 CUDA_MODULE_LOADING=LAZY ./sample_onnx_mnist & \
CUDA_VISIBLE_DEVICES=8 CUDA_MODULE_LOADING=LAZY ./sample_onnx_mnist & \
CUDA_VISIBLE_DEVICES=9 CUDA_MODULE_LOADING=LAZY ./sample_onnx_mnist

Results:

LAUNCH!
worker: 0 - mean infer runtime (microseconds): 111.138
DONE
LAUNCH!
LAUNCH!
LAUNCH!
worker: 0 - mean infer runtime (microseconds): 65.64
DONE
worker: 0 - mean infer runtime (microseconds): 51.071
DONE
worker: 0 - mean infer runtime (microseconds): 69.823
DONE
LAUNCH!
LAUNCH!
LAUNCH!
LAUNCH!
LAUNCH!
worker: 0 - mean infer runtime (microseconds): 70.371
DONE
worker: 0 - mean infer runtime (microseconds): 69.557
DONE
worker: 0 - mean infer runtime (microseconds): 69.252
DONE
worker: 0 - mean infer runtime (microseconds): 101.15
DONE
LAUNCH!
worker: 0 - mean infer runtime (microseconds): 68.413
DONE
worker: 0 - mean infer runtime (microseconds): 63.623
DONE

Still a bit slower than no concurrency at all, but that is expected due to bandwidth limitations of the underlying machine.

Here’s part of the trace (shows three GPUs):

Zoomed in:

No more pthread_mutex_lock! And much better throughput and GPU utilization. The only difference between this one and the previous run is the fact that is one parallelizes over processes, instead of threads.

Attachments

sampleOnnxMNIST.cpp (UPDATED VERSION 13.3 KB)
start_sample_onnx_mnist_single_process.sh (45 Bytes)
start_sample_onnx_mnist_multi_process.sh (716 Bytes)

EDIT: I looked at the traces and this time is not just pthread_mutex_lock but also a lot of other stuff. In the end the behavior is the same though. I’ll try and see if I can get a cleaner repro with a larger model.

EDIT 2: Fixed the repro! It now only measures contention during inference (just enqueue).

spolisetty · January 23, 2023, 4:27am

Hi,

We may need to confirm if this is a TensorRT problem and not a CUDA problem.
Do you observe the same problem with a purely CUDA program?

Also, looks like sharing the same ILogger across multiple threads causing the lock contention.
In the code, you are creating the IRuntime instance using the same global logger.

SampleUniquePtr<IRuntime> runtime{createInferRuntime(sample::gLogger.getTRTLogger())};

Could you please try using one logger per thread?

Thank you.

gerwin · January 23, 2023, 10:50am

Do you observe the same problem with a purely CUDA program?

Let me try and see if I can get a repro in CUDA. Do you have a suggestion for how to create an artificial load with CUDA?

Problem persists:

LAUNCH!
worker: 3 - mean infer runtime (microseconds): 830.155
worker: 8 - mean infer runtime (microseconds): 838.224
worker: 1 - mean infer runtime (microseconds): 1183.69
worker: 6 - mean infer runtime (microseconds): 1199.75
worker: 9 - mean infer runtime (microseconds): 1248.79
worker: 5 - mean infer runtime (microseconds): 1230.04
worker: 7 - mean infer runtime (microseconds): 1229
worker: 0 - mean infer runtime (microseconds): 1225.78
worker: 2 - mean infer runtime (microseconds): 1268.99
worker: 4 - mean infer runtime (microseconds): 1261.02
DONE

Profiler shows lots of pthread_mutex_lock again.

During testing, I did notice that the issues doesn’t always manifest. After restarting the server, the issue was completely gone, then after a couple runs it came back. There seems to be some randomness to it and I don’t know where that comes from.

gerwin · January 23, 2023, 12:34pm

I modified the matrixMul sample in the CUDA samples. See code attached. Same problem exists for CUDA as well it seems.

Edit: Earlier version of this post stated that the problem wasn’t in CUDA, but I just used too few iterations and turns out they were not overlapping. After changing that, it seems the issue is indeed in CUDA.

Single-GPU

CUDA_VISIBLE_DEVICES=0 ./matrixMul

[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
LAUNCH!
Performance= 1445.88 GFlop/s, Time= 0.091 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE

Runtime around 0.091 msec

Multi-threaded, multi-GPU

./matrixMul

[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
Computing result using CUDA Kernel...
done
done
Computing result using CUDA Kernel...
Computing result using CUDA Kernel...
done
done
LAUNCH!
Performance= 413.12 GFlop/s, Time= 0.317 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Performance= 408.03 GFlop/s, Time= 0.321 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Performance= 398.93 GFlop/s, Time= 0.329 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Performance= 391.88 GFlop/s, Time= 0.334 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Performance= 393.24 GFlop/s, Time= 0.333 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Performance= 389.40 GFlop/s, Time= 0.337 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Performance= 392.55 GFlop/s, Time= 0.334 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Result = PASS
Performance= 390.90 GFlop/s, Time= 0.335 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: 
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Performance= 388.16 GFlop/s, Time= 0.338 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Performance= 386.39 GFlop/s, Time= 0.339 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE

Runtime around 0.33 msec (4x slower)

This is what the trace looks like:

Multi-process, multi-GPU

CUDA_VISIBLE_DEVICES=0 CUDA_MODULE_LOADING=LAZY ./matrixMul & \
CUDA_VISIBLE_DEVICES=1 CUDA_MODULE_LOADING=LAZY ./matrixMul & \
CUDA_VISIBLE_DEVICES=2 CUDA_MODULE_LOADING=LAZY ./matrixMul & \
CUDA_VISIBLE_DEVICES=3 CUDA_MODULE_LOADING=LAZY ./matrixMul & \
CUDA_VISIBLE_DEVICES=4 CUDA_MODULE_LOADING=LAZY ./matrixMul & \
CUDA_VISIBLE_DEVICES=5 CUDA_MODULE_LOADING=LAZY ./matrixMul & \
CUDA_VISIBLE_DEVICES=6 CUDA_MODULE_LOADING=LAZY ./matrixMul & \
CUDA_VISIBLE_DEVICES=7 CUDA_MODULE_LOADING=LAZY ./matrixMul & \
CUDA_VISIBLE_DEVICES=8 CUDA_MODULE_LOADING=LAZY ./matrixMul & \
CUDA_VISIBLE_DEVICES=9 CUDA_MODULE_LOADING=LAZY ./matrixMul

[Matrix Multiply Using CUDA] - Starting...
[Matrix Multiply Using CUDA] - Starting...
[Matrix Multiply Using CUDA] - Starting...
[Matrix Multiply Using CUDA] - Starting...
[Matrix Multiply Using CUDA] - Starting...
[Matrix Multiply Using CUDA] - Starting...
[Matrix Multiply Using CUDA] - Starting...
[Matrix Multiply Using CUDA] - Starting...
[Matrix Multiply Using CUDA] - Starting...
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
done
Computing result using CUDA Kernel...
Computing result using CUDA Kernel...
done
done
Computing result using CUDA Kernel...
done
LAUNCH!
LAUNCH!
LAUNCH!
LAUNCH!
LAUNCH!
LAUNCH!
LAUNCH!
LAUNCH!
LAUNCH!
LAUNCH!
Performance= 1447.73 GFlop/s, Time= 0.091 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE
Performance= 1432.36 GFlop/s, Time= 0.092 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE
Performance= 1450.05 GFlop/s, Time= 0.090 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE
Performance= 1469.87 GFlop/s, Time= 0.089 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE
Performance= 1456.77 GFlop/s, Time= 0.090 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE
Performance= 1462.62 GFlop/s, Time= 0.090 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE
Performance= 1426.73 GFlop/s, Time= 0.092 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE
Performance= 1436.94 GFlop/s, Time= 0.091 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Performance= 1427.04 GFlop/s, Time= 0.092 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE
DONE
Performance= 1443.78 GFlop/s, Time= 0.091 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
DONE

Runtime around 0.09 msec (same as sequential, so fastest possible)

The trace:

Resources

matrixMul.cu (modified version) (12.2 KB)

gerwin · January 23, 2023, 12:54pm

Since it seems this is a CUDA issue, I created a new thread here: CUDA won't concurrently run kernels on multiple devices from within same process

spolisetty · January 23, 2023, 5:29pm