cudaEventRecord and NSight Systems shows different duration for CUDA API

I made a program using nppiRemap, and I want to know exact duration of my nppi function call. I used 2 ways, 1) using cudaEventElapsedTime, 2) inspecting CUDA API events view of Nsight Systems.

The problem is that these ways show quite different results each other. By cudaEventElapsedTime the duration of my nppiRemap function is ~31 ms, while it appears that the duration is ~38 ms in Nsight Systems.

I thought 2 methods I used to measure the duration of CUDA events is equivalent until now… Which result is more reliable?

@skottapalli to respond

Are you trying to measure the duration of the API call on the CPU side? Comparing just one instance of the function call between the two methods of measurement may not be statistically significant. You’ll need to take averages over multiple function invocations.

Measuring call durations using cudaEventElapsedTime and using Nsight Systems is not the same thing. There is definitely some overhead to using Nsight Systems. It depends on what else you are tracing or sampling in Nsight Systems. cudaEventElapsedTime just records the duration of the CUDA API call whereas Nsight Systems traces and samples a lot more by default that could add additional overhead, but the trends will mostly match between the two methods. If you make a change to make the function faster, then you should see the function duration decrease in both methods of measurement.