TensorRT Builder timing cache - preventing inaccurate timings due to concurrent GPU use

Description

I have a general question about how to ensure the TensorRT builder produces accurate timings. The documentation in Developer Guide :: NVIDIA Deep Learning TensorRT Documentation states:

  • The builder times algorithms to determine the fastest. Running the builder in parallel with other GPU work may perturb the timings, resulting in poor optimization.

I’m implementing a multithreaded application using the TensorRT API in C++, and in the application, at any given point it is possible there may be multiple threads using the GPU, performing inference with different models. Asynchronously every once in a while, a thread may also receive a new model to perform inference with, which it will construct using the builder API. Given the structure of the overall application, it is not trivial for such a thread to “know” what other threads are doing at the time and the degree to which they are using the GPU.

My question is - how problematic can this concurrent GPU use by other threads within the same application actually be in practice, in terms of obtaining accurate timings for the builder to optimize well? And is there a nice way to ensure that reasonably accurate timings are obtained even in such a case?

For example, would it be sufficient to run the builder using the legacy CUDA stream, so that it synchronizes all of its kernel executions with those of all other threads? Would this successfully ensure that the builder’s kernel timings “block out” those of other threads and ensure reasonable results even under concurrent and variable usage of the GPU by other threads? (note: for this application, generally threads will be using cudaStreamPerThread, and will not be using cudaStreamNonBlocking much or at all)

Environment

TensorRT Version: 8.0+
GPU Type: Various
Nvidia Driver Version: Various
CUDA Version: 11.1+
CUDNN Version: 8.2+

Hi,
The below link might be useful for you
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#thread-safety
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-priorities
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html
For multi threading/streaming, will suggest you to use Deepstream or TRITON
For more details, we recommend you to raise the query to the Deepstream or TRITON forum.

Thanks!

Thanks, but I cannot find an answer to our question in the links you’ve provided. The issue I raised is not one of thread safety, and I am comfortable for now with how stream priority and synchronization works. I’m also not interested in adding additional complexity to our system using other things like DeepStream or Triton. CUDA + TensorRT is more than capable of handling our use case.

Let me reiterate the specific technical question: If we run the TensorRT builder API using the legacy CUDA stream or otherwise a stream configured such that the documentation indicates it is guaranteed to synchronize with any other calls we expect to run on the GPU (as opposed to non-blocking work on the GPU that to my understanding would run genuinely concurrently), will that be sufficient for TensorRT to obtain reasonable timings?

If there is anyone who has the knowledge or experience to answer this question (or even just an informed opinion that would help give clarity to this issue), it would be a great help.

Hi,

Sorry for the delayed response.
The thread is not talking about builder timing cache at all. However I think the run builder in legacy (default) stream sounds reasonable since the synchronization there will block other concurrent streams in app. But we have never tried that before. You can have a try and report issues if any.

The general guideline from TRT is to avoid running builder in parallel with other concurrent GPU workloads because it is hard to tell whether GPU (or memory system) is busy or not while builder profiling some specific tactics.

Thank you.

1 Like