Environment:
-
GPU: RTX 5060 Ti
-
CUDA Version: [Please insert your CUDA version, e.g., 12.1]
-
TensorRT Version: [Please insert your TRT version, e.g., 8.6]
-
OS: [Please insert your OS, e.g., Ubuntu 22.04]
Description: Hi community,
I am profiling a TensorRT inference pipeline using Nsight Systems (nsys) and have encountered two confusing behaviors regarding the enqueue (or enqueueV3) execution time and implicit synchronizations.
My pipeline consists of two models executed sequentially: a ViTPrefill model followed by a decode model.
Question 1: Inconsistent enqueue behavior
When I load and execute the ViTPrefill model alone, the enqueue process is purely asynchronous. There is no cudaEventSynchronize at the tail end of the API call.
However, when I load both the ViTPrefill and decode models into memory, and execute them sequentially (first ViTPrefill, then decode), the nsys API trace for ViTPrefill changes dramatically. At the tail end of the ViTPrefill enqueue call, multiple cudaEventSynchronize events appear.
-
Observation: The number of these synchronization events roughly matches the number of output tensors of the
ViTPrefillmodel. -
My question: Shouldn’t the
enqueuebehavior of the exact same TensorRT engine be completely deterministic and identical? Why does simply loading a second model into memory cause the first model’senqueueto implicitly synchronize at the end? Could this be related to VRAM pressure triggering Unified Memory (UVM) page faults, or something related to how output tensor shapes are queried?
Question 2: Discrepancies between “Threads” and “CUDA HW” timelines in nsys
For the scenario where both models are loaded, I noticed a significant discrepancy in how the enqueue duration is represented in different nsys timelines:
-
In the Threads (CPU) timeline, the
enqueueblock is much longer and visually encloses/contains thecudaEventSynchronizeevents at its tail. -
In the CUDA HW (GPU) timeline, the actual execution duration is different. There are no
cudaEventSynchronizeevents visible at the end of the compute sequence, only kernels and memory operations. -
My question: What exactly do these two different representations mean in this context? Why does the Threads enqueue timeline include the sync overhead inside the
enqueueboundary, while the hardware timeline does not? How should I properly interpret the “true” inference latency of theenqueuestep here?
Any insights into TensorRT’s internal synchronization logic or how to properly interpret these Nsight Systems traces would be greatly appreciated!


