I was benchmarking my kernels on V100 provided by Google Cloud, and found that it’s faster by 18% when running watch -n 0.1 nvidia-smi
.
The persistency mode is enabled. The CUDA driver is installed through the official deb without any changes. Driver Version: 455.23.05, CUDA Version: 11.1
One benchmark involves several kernel calls, and the benchmark is run multiple times. So I assume the kernels are warmed up for the second run.
The time measured by CUDA events and CPU time separately. They are consisted.
The running time with watch -n 0.1 nvidia-smi
is more closer to the time measured by ncu.
The kernels are all memory bandwidth bounded.
I know it sounds extremely weird but it indeed happened. I am wondering if anyone runs into the same problem or has any comments on this.