TensorRT Builder timing cache - preventing inaccurate timings due to concurrent GPU use

dwu · October 10, 2021, 3:16pm

Description

I have a general question about how to ensure the TensorRT builder produces accurate timings. The documentation in Developer Guide :: NVIDIA Deep Learning TensorRT Documentation states:

The builder times algorithms to determine the fastest. Running the builder in parallel with other GPU work may perturb the timings, resulting in poor optimization.

I’m implementing a multithreaded application using the TensorRT API in C++, and in the application, at any given point it is possible there may be multiple threads using the GPU, performing inference with different models. Asynchronously every once in a while, a thread may also receive a new model to perform inference with, which it will construct using the builder API. Given the structure of the overall application, it is not trivial for such a thread to “know” what other threads are doing at the time and the degree to which they are using the GPU.

My question is - how problematic can this concurrent GPU use by other threads within the same application actually be in practice, in terms of obtaining accurate timings for the builder to optimize well? And is there a nice way to ensure that reasonably accurate timings are obtained even in such a case?

For example, would it be sufficient to run the builder using the legacy CUDA stream, so that it synchronizes all of its kernel executions with those of all other threads? Would this successfully ensure that the builder’s kernel timings “block out” those of other threads and ensure reasonable results even under concurrent and variable usage of the GPU by other threads? (note: for this application, generally threads will be using cudaStreamPerThread, and will not be using cudaStreamNonBlocking much or at all)

Environment

TensorRT Version: 8.0+
GPU Type: Various
Nvidia Driver Version: Various
CUDA Version: 11.1+
CUDNN Version: 8.2+

NVES · October 11, 2021, 2:36am

Hi,
The below link might be useful for you
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#thread-safety

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html
For multi threading/streaming, will suggest you to use Deepstream or TRITON
For more details, we recommend you to raise the query to the Deepstream or TRITON forum.

Thanks!

dwu · October 13, 2021, 12:50am

Thanks, but I cannot find an answer to our question in the links you’ve provided. The issue I raised is not one of thread safety, and I am comfortable for now with how stream priority and synchronization works. I’m also not interested in adding additional complexity to our system using other things like DeepStream or Triton. CUDA + TensorRT is more than capable of handling our use case.

Let me reiterate the specific technical question: If we run the TensorRT builder API using the legacy CUDA stream or otherwise a stream configured such that the documentation indicates it is guaranteed to synchronize with any other calls we expect to run on the GPU (as opposed to non-blocking work on the GPU that to my understanding would run genuinely concurrently), will that be sufficient for TensorRT to obtain reasonable timings?

If there is anyone who has the knowledge or experience to answer this question (or even just an informed opinion that would help give clarity to this issue), it would be a great help.

spolisetty · October 16, 2021, 1:43pm

Hi,

Sorry for the delayed response.
The thread is not talking about builder timing cache at all. However I think the run builder in legacy (default) stream sounds reasonable since the synchronization there will block other concurrent streams in app. But we have never tried that before. You can have a try and report issues if any.

The general guideline from TRT is to avoid running builder in parallel with other concurrent GPU workloads because it is hard to tell whether GPU (or memory system) is busy or not while builder profiling some specific tactics.

Thank you.

Topic		Replies	Views
Tensorrt Threads affect each other during multithreaded inference TensorRT tensorrt	16	1329	September 6, 2024
Is TensorRT safe to create engine & context in one thread, and execute in another thread? TensorRT	1	683	June 5, 2022
Batch inference parallelization on tensorrt TensorRT tensorrt , cuda	5	951	May 5, 2021
how to run trt in multithreading？ Jetson TX2	15	7920	October 18, 2021
Inference Time When Using Multi Stream in TensorRT is Much Slower than a Single One TensorRT tensorrt	5	2434	March 30, 2023
TensorRT MultiThread with MultiGPU TensorRT	1	474	February 14, 2023
Is multi threaded execution possible with tensorRT? TensorRT	3	2227	April 13, 2020
[Feature request] Make using incompatible timing caches for building CUDA engines not a hard error TensorRT	6	564	August 4, 2022
Thread safe while use tensorRT TensorRT	1	2566	March 25, 2019
Speeding up multi-threaded C++ program of TensorRT models TensorRT tensorrt	7	1275	February 20, 2025

TensorRT Builder timing cache - preventing inaccurate timings due to concurrent GPU use

Description

Environment

Related topics