I get different time in ncu and pytorch prolifer

I profile the time of a conv3d, but I got different time from ncu and pytorch profile.

  1. I profile with pytorch like this:
with profile(activities=[ProfilerActivity.CUDA, ProfilerActivity.CPU], record_shapes = True, profile_memory = True, with_modules=True, ) as prof:
    output = model(inputs)

The profile use CUPTI to get kernel time
I got 13000 ms of the kernel.
2. I run the same python script with ncu. I got 16000 ms of the kernel.
pytorch/CUPTI got less time ? there is 23% error. which one is more accurate?

CUPTI tracing measures execution time differently from Nsight Compute’s kernel profiling. Both are correct, in the sense that the reported values are consistent with the remaining reported data by each tool or library, respectively. For the sake of measuring pure kernel runtime, it is recommended to rely on CUPTI or Nsight Systems. For measuring kernel-level performance metrics, it is recommended to rely on Nsight Compute.