Description
According to the documentation, TensorRT may “may use worker threads internally”. The scheduling/pinning of CPU threads to CPU cores has performance implications; I want to have finer control over whether these worker threads get created and where they are affinitized.
Ideally, I want 0 worker threads so that I don’t have to allocate them any cores. If there are worker threads, they have to be affinitized somewhere and if affinitized to the same core, different threads will compete for the same core, resulting in unguaranteed inference latency or potential inference latency jitter. I don’t understand what these worker threads do, though they appear to be sleeping most of the time, but generally speaking in the worst case, for best performance, I should give each worker thread its own core to minimize resource contention. Thus, to avoid wasting cores, I would like to minimize the number of worker threads.
Does TensorRT provide a way to control the number of worker threads created? In my use case that uses one context from one thread, I observe that TensorRT typically creates 2-3 worker threads automatically, but I don’t have 2-3 free cores to allocate to the worker threads.
What do these worker threads actually do? In my use case, I do one inference from the main thread at a time. Can I just chuck these worker threads to a core shared by many other threads so that my main thread can live in peace on an exclusive core without hurting the inference latency? In other words, does making the worker threads share one core with other misc threads hurt inference performance in my use case?
Environment
TensorRT Version: 10.1.0.27
GPU Type: L4
Nvidia Driver Version: 550.90.07
CUDA Version: 12.4
CUDNN Version:
Operating System + Version: AlmaLinux 8
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): Baremetal