Kernel time of Nsight system is larger than nsight compute

environment
RTX 4090
locked frequency using nvidia-smi -lgc and nvidia-smi -lmc
cuda11.8
nsys version = 2023.3.1.92-233133147223v0
ncu version = 2023.3.0.0

problem
when I profiling torch.matmul with fixed input shape, kernel time of Nsight system is larger than nsight compute
when using nsys, its duration is about 40 us:
command nsys profile --stats true python MYPROGRAM


when using ncu, its duration is about 30 us:
command: /usr/local/cuda-11.8/bin/ncu --set full --section MemoryWorkloadAnalysis_Chart -f -o temp python MYPROGRAM


Why so much gap, any reason?
PS: I have clear L2 cache manually by
cache = torch.ones(…)
cache.zeros_()

If you are using Nsight Compute then you are serializing kernels which changes the execution on the GPU. Nsight Compute targets single kernel profiling so its job is to make the information for the kernel execution as accurate as possible by isolating the kernel. gpu__time_duration.sum introduces no measurable overhead to the kernel execution. However, gpu__time_duration is end timestamp - start timestamp. If the kernel is long enough it is likely to context switch and the duration will include time the GPU spent executing another context.

Nsight Systems/CUPTI support the most accurate start timestamp and end timestamp that we can do on the GPU. It is recommended that you use Nsight Systems prior to use Nsight Compute to optimize individual kernels. It is also recommended that you iterate back and forth between the two tools as you can naively optimize a kernel at the loss of concurrency between kernels resulting in a performance regression.

Due to these differences in collection methodologies there can variations, particularly in smaller kernels. It’s recommended to start with Nsight Systems for kernel runtime measurement and Nsight Compute for measuring the performance internals of the kernel.

@hwilper Thanks for your reply. I have an torch.matmul op (specifically, [1, 2048] @ [2048, 6144]), ncu profiling result show me the bandwidth for this op is 838GB/s(about 2048 * 6144 * sizeof(float16) / 30us). However, nsys profiling result show me the bandwidth is 629GB/s. So I was confused about whether there is any optimization I can do for this matmul op.
PS : on 4090, the maximal bandwidth is 1005GB/s, if nsys result is more accurate, I think I can do some optimizations for this op

There simply isn’t enough information to know whether that Kernel is your problem kernel or not.

What you need to do is start with your Nsight Systems results and dive down to see where there are gaps in the GPU usage. From there you will be able to theorize about if you are spending too much time in memory transfers, if the GPU is starving, or what your longest kernels are.

From there you can consider going to Nsight Compute to see the best way to optimize specific kernels. But I really wouldn’t do that before you are sure that the GPU isnt’ starving for work.

If you have not gone through it, I highly recommend taking a look at https://developer.nvidia.com/blog/optimizing-cuda-memory-transfers-with-nsight-systems/

1 Like

@hwilper I have read the blog you provided, but my issue is not about optimizing which CUDA kernel, nor is it about figuring out whether it is the memory transfer or the kernels that are more time-consuming in the entire system. What I am trying to understand is why the kernel time displayed in my nsys is MUCH LARGER than the time shown in ncu. For specific analysis, I can provide a minimal reproducible demo. Below is the code I used for testing, which only has three kernels: one for generating random numbers, one for setting values to 0, and one for performing matrix multiplication.

import torch  # torch version is 1.12.1+cu116
M = 1
K = 2048
N = 6144
perf_type = torch.float32
a = torch.randn((M, K), device='cuda', dtype=perf_type)
b = torch.randn((K, N), device='cuda', dtype=perf_type)
cache_ = torch.ones(int(256e6), dtype=torch.int, device='cuda')
for i in range(0, 100):
    cache_.zero_()    # to clear cache
    c0 = torch.matmul(a, b)

Attached are the nsys and ncu files I obtained, in case you need them. (It seems that the forum does not allow uploading files with the nsys-rep/ncu-rep extension, so I changed the extension to .log.)
command: /usr/local/cuda-11.8/bin/ncu --kernel-name ampere_sgemm_32x32_sliced1x4_nn -o torch_matmul "python" benchmark_gevm.py
torch_matmul_ncu.log (4.2 MB)
command: nsys profile --stats true -t cuda,nvtx -s none python benchmark_gevm.py
torch_matmul_nsys.log (273.1 KB)

I can’t run this command, error as follows:

==PROF== Disconnected from process 3832955
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

Can you confirm that there is no problem with this command?

@Rookie_programmer It seems in my ncu version(2023.3.0.0), --target-processes equals to all by default.

Maybe you can add --cache-control none and --clock-control none in ncu command as this topic, I had a similar problem, but in my tests ncu time lower than nsys time.

Anyway, the topic i mentioned above help me a lot. I hope it works for you, too

The nsys kernel time is much larger because it reflects the actual cost of things like context switching when you have overlapping kernels.

In terms of your actual performance: In the nsys-rep you have ~200ms of GPU activity over the 4 seconds that you are running. In particular, you have multi-second CUDA api synchronization CPU calls wrapped around <4 ms of GPU work at about 2.26 seconds. Your GPU is starving through most of the run.

However, there are some errors thrown during the run (diagnostic pane) it might be worth updating to 2024.2 from 2023.3

Thanks for your suggestion. If I set --cache-control none and --clock-control none, I can not keep them same when doing ncu profiling and nsys profiling. So I manually set gpu clock to a fixed value and clear cache with cache = torch.ones(…) cache.zeros_().

I don’t understand what does multi-second CUDA api synchronization CPU calls mean? I find no synchronization-related operation at about 2.26 seconds. Could you explain that more clealy?
FYI: I have changed from nsys 2023.3 to 2024.2, same problem as mentioned above.

Here is a screen shot of that part of the timeline. You see that there is a lot of time on the CPU spent in CUDA calls, but there is not a lot of CUDA work going on on the GPU. Now the diagnostic pane shows that there may be CUDA calls missing, but since this is after the CUPTI initialization, I would be surprised to see it here:

Contrast that with this zoom in from later in the run, where there are a few short API calls on the CPU side and long kernels on the GPU: