If you are using Nsight Compute then you are serializing kernels which changes the execution on the GPU. Nsight Compute targets single kernel profiling so its job is to make the information for the kernel execution as accurate as possible by isolating the kernel. gpu__time_duration.sum introduces no measurable overhead to the kernel execution. However, gpu__time_duration is end timestamp - start timestamp. If the kernel is long enough it is likely to context switch and the duration will include time the GPU spent executing another context.
Nsight Systems/CUPTI support the most accurate start timestamp and end timestamp that we can do on the GPU. It is recommended that you use Nsight Systems prior to use Nsight Compute to optimize individual kernels. It is also recommended that you iterate back and forth between the two tools as you can naively optimize a kernel at the loss of concurrency between kernels resulting in a performance regression.