Why does NCU perform global serialized execution for all current kernels during kernel replay?

FlyK · November 7, 2023, 10:15am

I have observed an interesting phenomenon where, when I implement a CUDA application with kernel a, kernel b, and kernel c parallelized on different streams, the launch order is kernel a, kernel b, kernel c, but the actual execution order on the GPU device is kernel b, kernel a, kernel c.
Then, when I use NCU for kernel replay and only profile kernel c, I notice that the execution order on the GPU device becomes kernel a, kernel b, kernel c. This behavior is a bit contrary to my expectations. I thought that NCU would only serialize the execution for the kernels I specified and minimize its impact on the actual application’s execution order.
Is this a bug or designed to behave this way?

Thanks

felix_dt · November 7, 2023, 11:22am

This behavior sounds expected. There should be no guarantee in which order the kernels execute on the GPU, if they are launched on independent streams, so the fact that you are observing a particular order (every or most of the time) should be considered coincidental, not expected.

That being said, while the impact on kernels that are not profiled is definitely lower than on profiled ones, there is still a significant overhead when executing the application under the tool, compared to running it without one. The expectation is that kernels that are not profiled are also not serialized, correct, but since the tool has to do work at the point where the kernel is launched (e.g. to check if it should be profiled or not), you may see a divergence from the plain execution. This is comparable to your CPU context switching during the kernel launch in the driver, in which case you would also observe an overhead.

If you are interested in understanding your system- or application-level performance, you should use Nsight Systems, which is less intrusive than Nsight Compute. The latter is used for understanding detailed per-workload (e.g. per-kernel) performance, at the cost of higher overhead.

FlyK · November 7, 2023, 11:36am

Let me see if I understand correctly. As you mentioned, NCU doesn’t enforce the serialization of all current kernels. Instead, it performs some checks before each kernel launch, which incurs significant overhead and leads to changes in the execution order on the device. So, this behavior aligns with expectations. For scenarios where preserving the original kernel execution timeline is crucial, it is recommended to use systems profiling, is that correct?

Thanks.

felix_dt · November 16, 2023, 4:58pm

Yes, that is correct.

FlyK · November 21, 2023, 12:16pm

Thank you.

system · December 5, 2023, 12:17pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Questions about concurrent kernels execution CUDA Programming and Performance	2	110	June 18, 2024
Ncu profile file not created Nsight Compute	5	1112	September 1, 2021
Profiling one application having two concurent kernels Nsight Compute	3	605	June 8, 2023
Kernel execution time increase 4x when using streams CUDA Programming and Performance	8	1696	August 13, 2015
Nsight Systems doesn't profile kernels Profiling Linux Targets	13	3285	January 27, 2022
Inconsistent kernel execution times, and affected by Nsight Systems CUDA Programming and Performance	1	337	April 23, 2024
NSight Profiling Crashes with error code (9) Nsight Compute	11	4529	January 16, 2024
Nsight-Compute returns “No kernels were profiled” warning Nsight Compute	9	1440	July 27, 2023
When using Nsight Compute, are more than two kernels profiled separately or concurrently? Nsight Compute	2	363	March 5, 2024
Sum of kernel time is different in ncu and nsys Profiling Linux Targets nsight	11	3234	March 15, 2022

Why does NCU perform global serialized execution for all current kernels during kernel replay?

Related topics