The kernels run faster when running `watch -n 0.1 nvidia-smi`

I was benchmarking my kernels on V100 provided by Google Cloud, and found that it’s faster by 18% when running watch -n 0.1 nvidia-smi.

The persistency mode is enabled. The CUDA driver is installed through the official deb without any changes. Driver Version: 455.23.05, CUDA Version: 11.1

One benchmark involves several kernel calls, and the benchmark is run multiple times. So I assume the kernels are warmed up for the second run.

The time measured by CUDA events and CPU time separately. They are consisted.

The running time with watch -n 0.1 nvidia-smi is more closer to the time measured by ncu.

The kernels are all memory bandwidth bounded.

I know it sounds extremely weird but it indeed happened. I am wondering if anyone runs into the same problem or has any comments on this.