One possible reason for the discrepency is that Nsight Compute invalidates all caches during the data collection, while nvvp does not invalidate the L2 cache. That might also cause the differences in numbers for L2 → Device.
Can you please attach a report from each tool so that we can investigate further?
By default, nvprof does a concurrent kernel trace whereas nv-nsight-cu does a serial trace.
In concurrent trace, the profiler incurs some overhead which is proportional to the number of blocks launched by the kernel, whereas serial trace is not affected by the number of blocks in the kernel.
Hence, it reports slightly more duration for a short running kernel.
For your case, which has small kernels and no concurrent kernels, you can use the serial trace number.
Nvprof also provides an option “ --concurrent-kernels off” to switch to serial trace.