Is the Nsight System accurate in measuring the execution time of the kernel?

Recently, I profiled the training process of LLM model through the Nsight System.
I tried to find the total execution time of all Communication Kernel that occurred during learning using the “stats” command.
Since all communication kernels were never executed in overlapping with each other, the total time was calculated by simply adding the total execution time of each communication.
However, the total execution time of communication kernels measured by the Nsight System was calculated to be much larger than the total training time measured by the time function in the Python script of Training.
In conclusion, the question is as follows.

  1. Does the Nsight System really provide the exact execution time for Kernel?
  2. The results of the cuda_gpu_kern_sum option provided by the stats command assume that all kernels operate serially to obtain the execution time % for each kernel, is this the right approach?
    Thank you for reading it.

The kernel execution time is based on stop/starts of each kernel from the underlying CUDA driver. It is as exact as can be determined.

I am wondering if you had compute kernels overlapping with your communication kernels, as both would be part of cuda_gpu_kern_sum. That script is intended to allow the user a rough statistical sense of how busy the GPU is.

What is the problem that you are trying to solve? Are you just trying to determine if there are gaps where the GPU is idle?

1 Like

What I want to measure is to numerically represent exactly how big Communication Overhead is in LLM Training.
As you said, we tried to consider the overlapping between kernels as much as possible, and we couldn’t find a way to see the overlapping between kernels numerically, but when we looked at the Nsys UI, we didn’t see the Communication Kernel that was overlapping each other.

There is a new multi-report recipe to determine overlap between communication and compute.

@jyi can you help with this?

As soon as I checked using the recipe command, the three kernels in the figure below do not overlap. Therefore, I decided that I could simply add the execution time of these three kernels, and as a result, I got a total execution time (51s) that exceeds the runtime (44.2s) of the actual application, leaving a question.

Hello, will you be able to share the report file (.nsys-rep) with us?

Absolutely. I have 4 reports in a 4-GPU environment with over 100MB capacity. How can I share it with you?

You could create a tarball with the report files and attach it here.

The report is too big, so even if i compress more than two files, it’s over 100MB. Therefore, I’ll upload only one. Since the training method is DDP, the results of all report files are similar. (54.5 MB)

Thank you for sharing. After running the ‘nccl_gpu_overlap_trace’ recipe, I see that the ‘ncclDevKernel_AllGather_RING_LL’ and ‘ncclDevKernel_ReduceScatter_Sum_bf16_RING_LL’ kernels do overlap with communication kernels:

This is an example of an AllGather kernel that overlaps with ReduceScatter kernel:

Thank you, but the % of Communication Overlap in all Kernel is less than 1%, so if i consider this, isn’t the total execution time of those 3 Communication Kernel still longer than the actual runtime?

I don’t think we are referring to the same report file. The one you’ve provided has a total duration of 166s, which is longer than the sum of the communication kernels, even without considering the overlaps.

Could you attach the report file you initially used?

We are looking at the same file, and the runtime(44.2s) I mentioned is not the Duration(166s) that appears on Nsys, but the value obtained through the time() function in the actual Training Python script.
I understand that the actual runtime and the runtime of Nsys are in different time domains because of the profiling overhead.
Then, what is the exact meaning of execution time of the kernel in the Nsight System?
If an application consisting of one kernel named B is profiled with Nsys and the execution time of the B kernel is 1s, does it take the same 1 second for the B kernel to be completed when only the application is executed without Nsys profiling?

The difference between the runtime and the duration seems abnormally high. Is there a way for us to get the application you are using to debug? If not, could you try running your application with the latest version of Nsight Systems using the command line nsys profile --trace cuda --sample none --cpuctxsw none <app> [args] and see if it helps?

If an application consisting of one kernel named B is profiled with Nsys and the execution time of the B kernel is 1s, does it take the same 1 second for the B kernel to be completed when only the application is executed without Nsys profiling?

They won’t be identical, but the added GPU latency per kernel will mostly be less than 1us.

1 Like

Since it is LLM training application, it will be difficult to share due to environment setup issue. Unfortunately, the result is the same even if i use the latest nsys version and the option you advised. Inevitably, I will proceed with the analysis in the time domain on Nsys. Thank you very much for your help. I will share it with you if the problem is resolved or any errors are found in the future. Thank you!