Is the Nsight System accurate in measuring the execution time of the kernel?

psh2018314072 · April 1, 2024, 5:24pm

Recently, I profiled the training process of LLM model through the Nsight System.
I tried to find the total execution time of all Communication Kernel that occurred during learning using the “stats” command.
Since all communication kernels were never executed in overlapping with each other, the total time was calculated by simply adding the total execution time of each communication.
However, the total execution time of communication kernels measured by the Nsight System was calculated to be much larger than the total training time measured by the time function in the Python script of Training.
In conclusion, the question is as follows.

Does the Nsight System really provide the exact execution time for Kernel?
The results of the cuda_gpu_kern_sum option provided by the stats command assume that all kernels operate serially to obtain the execution time % for each kernel, is this the right approach?
Thank you for reading it.

hwilper · April 2, 2024, 11:47am

The kernel execution time is based on stop/starts of each kernel from the underlying CUDA driver. It is as exact as can be determined.

I am wondering if you had compute kernels overlapping with your communication kernels, as both would be part of cuda_gpu_kern_sum. That script is intended to allow the user a rough statistical sense of how busy the GPU is.

What is the problem that you are trying to solve? Are you just trying to determine if there are gaps where the GPU is idle?

psh2018314072 · April 4, 2024, 12:00am

What I want to measure is to numerically represent exactly how big Communication Overhead is in LLM Training.
As you said, we tried to consider the overlapping between kernels as much as possible, and we couldn’t find a way to see the overlapping between kernels numerically, but when we looked at the Nsys UI, we didn’t see the Communication Kernel that was overlapping each other.

hwilper · April 4, 2024, 2:52pm

There is a new multi-report recipe to determine overlap between communication and compute.

@jyi can you help with this?

psh2018314072 · April 4, 2024, 2:59pm

As soon as I checked using the recipe command, the three kernels in the figure below do not overlap. Therefore, I decided that I could simply add the execution time of these three kernels, and as a result, I got a total execution time (51s) that exceeds the runtime (44.2s) of the actual application, leaving a question.

jyi · April 4, 2024, 3:51pm

Hello, will you be able to share the report file (.nsys-rep) with us?

psh2018314072 · April 4, 2024, 3:59pm

Absolutely. I have 4 reports in a 4-GPU environment with over 100MB capacity. How can I share it with you?

jyi · April 4, 2024, 4:37pm

You could create a tarball with the report files and attach it here.

psh2018314072 · April 4, 2024, 4:50pm

The report is too big, so even if i compress more than two files, it’s over 100MB. Therefore, I’ll upload only one. Since the training method is DDP, the results of all report files are similar.
report_1.zip (54.5 MB)

jyi · April 4, 2024, 5:43pm

Thank you for sharing. After running the ‘nccl_gpu_overlap_trace’ recipe, I see that the ‘ncclDevKernel_AllGather_RING_LL’ and ‘ncclDevKernel_ReduceScatter_Sum_bf16_RING_LL’ kernels do overlap with communication kernels:

This is an example of an AllGather kernel that overlaps with ReduceScatter kernel:

psh2018314072 · April 5, 2024, 1:54am

Thank you, but the % of Communication Overlap in all Kernel is less than 1%, so if i consider this, isn’t the total execution time of those 3 Communication Kernel still longer than the actual runtime?

jyi · April 5, 2024, 5:33pm

I don’t think we are referring to the same report file. The one you’ve provided has a total duration of 166s, which is longer than the sum of the communication kernels, even without considering the overlaps.

Could you attach the report file you initially used?

psh2018314072 · April 5, 2024, 5:44pm

We are looking at the same file, and the runtime(44.2s) I mentioned is not the Duration(166s) that appears on Nsys, but the value obtained through the time() function in the actual Training Python script.
I understand that the actual runtime and the runtime of Nsys are in different time domains because of the profiling overhead.
Then, what is the exact meaning of execution time of the kernel in the Nsight System?
If an application consisting of one kernel named B is profiled with Nsys and the execution time of the B kernel is 1s, does it take the same 1 second for the B kernel to be completed when only the application is executed without Nsys profiling?

jyi · April 5, 2024, 6:48pm

The difference between the runtime and the duration seems abnormally high. Is there a way for us to get the application you are using to debug? If not, could you try running your application with the latest version of Nsight Systems using the command line nsys profile --trace cuda --sample none --cpuctxsw none <app> [args] and see if it helps?

If an application consisting of one kernel named B is profiled with Nsys and the execution time of the B kernel is 1s, does it take the same 1 second for the B kernel to be completed when only the application is executed without Nsys profiling?

They won’t be identical, but the added GPU latency per kernel will mostly be less than 1us.

psh2018314072 · April 6, 2024, 1:44pm

Since it is LLM training application, it will be difficult to share due to environment setup issue. Unfortunately, the result is the same even if i use the latest nsys version and the option you advised. Inevitably, I will proceed with the analysis in the time domain on Nsys. Thank you very much for your help. I will share it with you if the problem is resolved or any errors are found in the future. Thank you!

Topic		Replies	Views
Kernel time of Nsight system is larger than nsight compute Profiling Linux Targets	11	1354	April 3, 2024
Does Nsight makes the kernel/memory time longer? Profiling Linux Targets	3	290	April 25, 2025
Is the profiling session duration equivalent to total runtime when using Nsight Systems? Profiling Linux Targets cuda , kernel , profiling	11	818	May 6, 2024
Inconsistent kernel execution times, and affected by Nsight Systems CUDA Programming and Performance	1	487	April 23, 2024
Inconsistent results with nsight systems Profiling Embedded Targets	5	994	June 20, 2023
Profiling one application having two concurent kernels Nsight Compute	3	778	June 8, 2023
cudaLaunchKernel very slow? (Edit: The problem is with Nsight Systems.) CUDA Programming and Performance	1	428	January 4, 2024
Cycles in nsight-compute and nsight-systems Nsight Compute	1	1378	October 26, 2022
How can I measure kernel launch overhead using ncu Nsight Compute	6	1691	April 20, 2023
How can I dissect different latencies with nsight systems? Profiling Linux Targets	3	1986	February 15, 2020

Is the Nsight System accurate in measuring the execution time of the kernel?

Related topics