[Nsight Systems] Unexpected cudaEventSynchronize during TRT enqueue when co-loading models, and discrepancies between CPU/GPU timelines

Environment:

  • GPU: RTX 5060 Ti

  • CUDA Version: [Please insert your CUDA version, e.g., 12.1]

  • TensorRT Version: [Please insert your TRT version, e.g., 8.6]

  • OS: [Please insert your OS, e.g., Ubuntu 22.04]

Description: Hi community,

I am profiling a TensorRT inference pipeline using Nsight Systems (nsys) and have encountered two confusing behaviors regarding the enqueue (or enqueueV3) execution time and implicit synchronizations.

My pipeline consists of two models executed sequentially: a ViTPrefill model followed by a decode model.

Question 1: Inconsistent enqueue behavior

When I load and execute the ViTPrefill model alone, the enqueue process is purely asynchronous. There is no cudaEventSynchronize at the tail end of the API call.

However, when I load both the ViTPrefill and decode models into memory, and execute them sequentially (first ViTPrefill, then decode), the nsys API trace for ViTPrefill changes dramatically. At the tail end of the ViTPrefill enqueue call, multiple cudaEventSynchronize events appear.

  • Observation: The number of these synchronization events roughly matches the number of output tensors of the ViTPrefill model.

  • My question: Shouldn’t the enqueue behavior of the exact same TensorRT engine be completely deterministic and identical? Why does simply loading a second model into memory cause the first model’s enqueue to implicitly synchronize at the end? Could this be related to VRAM pressure triggering Unified Memory (UVM) page faults, or something related to how output tensor shapes are queried?

Question 2: Discrepancies between “Threads” and “CUDA HW” timelines in nsys

For the scenario where both models are loaded, I noticed a significant discrepancy in how the enqueue duration is represented in different nsys timelines:

  • In the Threads (CPU) timeline, the enqueue block is much longer and visually encloses/contains the cudaEventSynchronize events at its tail.

  • In the CUDA HW (GPU) timeline, the actual execution duration is different. There are no cudaEventSynchronize events visible at the end of the compute sequence, only kernels and memory operations.

  • My question: What exactly do these two different representations mean in this context? Why does the Threads enqueue timeline include the sync overhead inside the enqueue boundary, while the hardware timeline does not? How should I properly interpret the “true” inference latency of the enqueue step here?

Any insights into TensorRT’s internal synchronization logic or how to properly interpret these Nsight Systems traces would be greatly appreciated!

@liuyis can you respond to this?

Hi @KarlDe ,

Question 1 sounds more related to TensorRT’s implementation. We are unfortunatley not the best to help, I suggest looking for help from AI & Data Science - NVIDIA Developer Forums or CUDA - NVIDIA Developer Forums . Here we are more focused on questions and issues related to Nsight Systems usage itself.

For Question 2 - the CUDA API row under the Threads (CPU) timeline track the CUDA API calls the application made on the CPU side. The CUDA HW (GPU) row shows the CUDA kernels, memcpy, memset etc. that happened on the GPU side, usually triggered by some CPU-side CUDA API calls. The help correlating between them, the CPU-side NVTX ranges that covers certain CUDA API calls are projected to the GPU-side CUDA workloads. cudaEventSynchronize is a CPU-side operation, therefore there won’t be any corresponding activity on CUDA HW (GPU) row. See also User Guide — nsight-systems 2025.1 documentation