Why does NCU perform global serialized execution for all current kernels during kernel replay?

I have observed an interesting phenomenon where, when I implement a CUDA application with kernel a, kernel b, and kernel c parallelized on different streams, the launch order is kernel a, kernel b, kernel c, but the actual execution order on the GPU device is kernel b, kernel a, kernel c.
Then, when I use NCU for kernel replay and only profile kernel c, I notice that the execution order on the GPU device becomes kernel a, kernel b, kernel c. This behavior is a bit contrary to my expectations. I thought that NCU would only serialize the execution for the kernels I specified and minimize its impact on the actual application’s execution order.
Is this a bug or designed to behave this way?

Thanks

This behavior sounds expected. There should be no guarantee in which order the kernels execute on the GPU, if they are launched on independent streams, so the fact that you are observing a particular order (every or most of the time) should be considered coincidental, not expected.

That being said, while the impact on kernels that are not profiled is definitely lower than on profiled ones, there is still a significant overhead when executing the application under the tool, compared to running it without one. The expectation is that kernels that are not profiled are also not serialized, correct, but since the tool has to do work at the point where the kernel is launched (e.g. to check if it should be profiled or not), you may see a divergence from the plain execution. This is comparable to your CPU context switching during the kernel launch in the driver, in which case you would also observe an overhead.

If you are interested in understanding your system- or application-level performance, you should use Nsight Systems, which is less intrusive than Nsight Compute. The latter is used for understanding detailed per-workload (e.g. per-kernel) performance, at the cost of higher overhead.

Let me see if I understand correctly. As you mentioned, NCU doesn’t enforce the serialization of all current kernels. Instead, it performs some checks before each kernel launch, which incurs significant overhead and leads to changes in the execution order on the device. So, this behavior aligns with expectations. For scenarios where preserving the original kernel execution timeline is crucial, it is recommended to use systems profiling, is that correct?

Thanks.

Yes, that is correct.

Thank you.