I got different results with cupti and nvprof. The results after running callbacke_timestamp.cu are compared with the results of nvprof. The running time of the kernel varies greatly, obviously nvprof results are much faster. How is the result obtained by cupti converted to subtle, is it divided by 1000?
GPU time :76288
nvprof avg :1.9840us ( vceadd )