[Nsight Systems] Unexpected cudaEventSynchronize during TRT enqueue when co-loading models, and discrepancies between CPU/GPU timelines

KarlDe · February 27, 2026, 2:27am

Environment:

GPU: RTX 5060 Ti
CUDA Version: [Please insert your CUDA version, e.g., 12.1]
TensorRT Version: [Please insert your TRT version, e.g., 8.6]
OS: [Please insert your OS, e.g., Ubuntu 22.04]

Description: Hi community,

I am profiling a TensorRT inference pipeline using Nsight Systems (nsys) and have encountered two confusing behaviors regarding the enqueue (or enqueueV3) execution time and implicit synchronizations.

My pipeline consists of two models executed sequentially: a ViTPrefill model followed by a decode model.

Question 1: Inconsistent `enqueue` behavior

When I load and execute the ViTPrefill model alone, the enqueue process is purely asynchronous. There is no cudaEventSynchronize at the tail end of the API call.

However, when I load both the ViTPrefill and decode models into memory, and execute them sequentially (first ViTPrefill, then decode), the nsys API trace for ViTPrefill changes dramatically. At the tail end of the ViTPrefill enqueue call, multiple cudaEventSynchronize events appear.

Observation: The number of these synchronization events roughly matches the number of output tensors of the ViTPrefill model.
My question: Shouldn’t the enqueue behavior of the exact same TensorRT engine be completely deterministic and identical? Why does simply loading a second model into memory cause the first model’s enqueue to implicitly synchronize at the end? Could this be related to VRAM pressure triggering Unified Memory (UVM) page faults, or something related to how output tensor shapes are queried?

Question 2: Discrepancies between “Threads” and “CUDA HW” timelines in nsys

For the scenario where both models are loaded, I noticed a significant discrepancy in how the enqueue duration is represented in different nsys timelines:

In the Threads (CPU) timeline, the enqueue block is much longer and visually encloses/contains the cudaEventSynchronize events at its tail.
In the CUDA HW (GPU) timeline, the actual execution duration is different. There are no cudaEventSynchronize events visible at the end of the compute sequence, only kernels and memory operations.
My question: What exactly do these two different representations mean in this context? Why does the Threads enqueue timeline include the sync overhead inside the enqueue boundary, while the hardware timeline does not? How should I properly interpret the “true” inference latency of the enqueue step here?

Any insights into TensorRT’s internal synchronization logic or how to properly interpret these Nsight Systems traces would be greatly appreciated!

hwilper · February 27, 2026, 9:04pm

@liuyis can you respond to this?

liuyis · February 27, 2026, 11:19pm

Hi @KarlDe ,

Question 1 sounds more related to TensorRT’s implementation. We are unfortunatley not the best to help, I suggest looking for help from AI & Data Science - NVIDIA Developer Forums or CUDA - NVIDIA Developer Forums . Here we are more focused on questions and issues related to Nsight Systems usage itself.

For Question 2 - the CUDA API row under the Threads (CPU) timeline track the CUDA API calls the application made on the CPU side. The CUDA HW (GPU) row shows the CUDA kernels, memcpy, memset etc. that happened on the GPU side, usually triggered by some CPU-side CUDA API calls. The help correlating between them, the CPU-side NVTX ranges that covers certain CUDA API calls are projected to the GPU-side CUDA workloads. cudaEventSynchronize is a CPU-side operation, therefore there won’t be any corresponding activity on CUDA HW (GPU) row. See also User Guide — nsight-systems 2025.1 documentation

Topic		Replies	Views
Unexpected cudaEventSynchronize during TRT enqueue when co-loading models TensorRT for RTX	0	36	March 2, 2026
Long Cuda Synchronization times in TensorRT inference (Python API) TensorRT tensorrt , cuda , python , cudnn	3	158	September 1, 2025
TensorRT enqueueV2 take a long time TensorRT cudnn	4	700	January 30, 2024
Enqueue Function not Asynchronous TensorRT tensorrt	5	765	October 12, 2021
Why the inference time of TensorRT enqueuev2 goes up gradually? TensorRT	1	516	December 31, 2023
cudaEventSynchronize waits for what? Visual Profiler and nvprof tensorrt , cuda , profiling	0	858	June 2, 2022
inference time of tensorrt is slower than tensorflow !!! TensorRT	2	1506	September 27, 2019
Latency when running TensorRT engine on two GPU TensorRT	9	1351	August 24, 2020
Synchronized inference or Asynchronized inference TensorRT	1	4579	December 5, 2018
TensorRT waiting after inference seemingly for no reason TensorRT tensorrt , cuda , performance , python	12	1733	October 20, 2022

[Nsight Systems] Unexpected cudaEventSynchronize during TRT enqueue when co-loading models, and discrepancies between CPU/GPU timelines

Question 1: Inconsistent enqueue behavior

Question 2: Discrepancies between “Threads” and “CUDA HW” timelines in nsys

Related topics

Question 1: Inconsistent `enqueue` behavior