I find the elapsed cycles with clock() function and by nsight compute is different.
nsight compute reported is 12052 cycles
clock() function measured is 10977 cycles ( + empty kernel cost 1896 cycles) = 12877 cycles, Error = 6.8%
This is nsight compute bug or my measure method is incorrect??
This should be neither a bug, nor is your measurement method incorrect. Those are simply two different ways to measure the kernel runtime. You will likely see that when comparing this to a runtime reported by Nsight Systems, the value will be slightly off, too. The reason is that both tools, and your implementation, all measure the kernel runtime somewhat differently to account for their specific overhead requirements.