I have a general question about how to ensure the TensorRT builder produces accurate timings. The documentation in Developer Guide :: NVIDIA Deep Learning TensorRT Documentation states:
- The builder times algorithms to determine the fastest. Running the builder in parallel with other GPU work may perturb the timings, resulting in poor optimization.
I’m implementing a multithreaded application using the TensorRT API in C++, and in the application, at any given point it is possible there may be multiple threads using the GPU, performing inference with different models. Asynchronously every once in a while, a thread may also receive a new model to perform inference with, which it will construct using the builder API. Given the structure of the overall application, it is not trivial for such a thread to “know” what other threads are doing at the time and the degree to which they are using the GPU.
My question is - how problematic can this concurrent GPU use by other threads within the same application actually be in practice, in terms of obtaining accurate timings for the builder to optimize well? And is there a nice way to ensure that reasonably accurate timings are obtained even in such a case?
For example, would it be sufficient to run the builder using the legacy CUDA stream, so that it synchronizes all of its kernel executions with those of all other threads? Would this successfully ensure that the builder’s kernel timings “block out” those of other threads and ensure reasonable results even under concurrent and variable usage of the GPU by other threads? (note: for this application, generally threads will be using cudaStreamPerThread, and will not be using cudaStreamNonBlocking much or at all)
TensorRT Version: 8.0+
GPU Type: Various
Nvidia Driver Version: Various
CUDA Version: 11.1+
CUDNN Version: 8.2+