Inconsistent results with nsight systems


I am trying to profile the sample apps of cuda (for now i am trying on vectorAdd and scalarProd) with Cuda 11.4 and nsys version is the 2022.3.3.18-4d5367b. I am not using any special options with nsys other than profile (of course) and --stats=true. However, I am getting different results with huge differences (for ex: 17000ns, 7840ns, 22000ns …)
What could be the issue?

Thank you.

The times you mention refer to the kernel execution time? Or total execution time for the application?

Is some other workload running on the GPU?

You could also use the latest Nsight Systems version, 2023.2, available here, to take advantage of more features and bug fixes.

Hello again,

No it was only for the kernel. I will try to update it on Monday.

Thank you for your support

Hello again,

I upgraded the nsight systems to 2022.5.2.171-32559007v0 because I am running on a Jetson. I still have the same results as before(different values).

I don’t think anything else is running on the GPU. I did also a reboot for the platform to make sure everything is reset. I also test with ncu command and I am having also the same fluctuations in execution time of the kernel. Last, I also implemented the performance metrics using the event([reference]) and I am obtaining the same fluctuations. (How to Implement Performance Metrics in CUDA C/C++ | NVIDIA Technical Blog)

Best regards,


I just made some tests using nsight compute and nsight systems. I launched each profiling software 100 times and I extracted the execution time of the vectorAdd kernel. For more info, I have attached the scripts(for reference) that launches the softwares and I obtained the following results(check picture attached). As you can see there is fluctuations in execution time which I think is a normal behavior of the GPU but the maximum values obtained with nsys is way higher than ncu. Which one is more reliable? In addition I have compared nsys with the cudaeventelapsedtime() And as you can see with nsys I obtained 22 us and with the cudaeventelapsedtime() I obtained 110us.
I hope I made it clearer this time
Thank you for your time.

Best regards. (269 Bytes) (409 Bytes)

One thing to note is that you are not comparing apples to apples here.

Nsight Systems is designed to give you information about the entire system. We perturb the running computer as little as possible.

Nsight Compute is designed to give you deep dive information at the kernel level. They intentionally alter the behavior of the system in order to get the best information about kernel speed of light performance. They may do things like replay a kernel multiple times internally to get averages.

Specifically I think the thing they do that you are hitting here is that they pin the GPU frequency at maximum for the duration of the run.