Hello, I am currently developing a matrix multiplication kernel, and I am in the process of comparing the performance against cuBLAS.
Measuring with Nsight Compute, ‘time’ of cuBLAS is shown to be 137.7 useconds, whereas my custom kernel is shown to be 265.09 useconds. I have been working on top of this information, optimizing my kernel with the metrics reported by Nsight Compute.
However, when I profile a program that calls both kernels with Nsight System, the (end-start) time difference shown on CUDA HW is quite different with cuBLAS taking 152.609 us and my kernel taking 119.937 us.
I understand that the rigorous profiling steps of Nsight Compute may add to the time reported in Nsight Compute, but because the kernel duration comparison is suddenly reversed from Compute to System, I am hesitant to draw the conclusion if my kernel is indeed faster.
There are small differences in the method of timing used by Nsight Compute and Nsight Systems. These differences have been discussed in other threads. However, we should first try to eliminate the environment controls in Nsight Compute used to improve the results for multi-pass collection.
Nsight Compute by default locks GPU clock to base clock. This can result in a different ratio of compute/bandwidth (GPCCLK to MCLK) favoring compute bound kernels.
Nsight Compute serializes the workload and tries to set the GPU into a consistent state between replays.
The recommended approach would be to run Nsight Compute in a manner closer to Nsight Systems.
Only collect GPU duration removing replay --metrics gpu__time_duration.sum
Do not lock clocks --clock-control none
Do not flush and invalidate caches --cache-control none
Any additional analysis requires a minimal reproducible, information on the system, and information on the toolchain and compilation.
I would recommend that you add a warm-up pass and time the kernel 10s to 100s of times to get a more accurate duration value.