performance measurement of full kernel inside of kernel

background:
I have a kernel that I measure with windows QPC (264 nanosecond tick rate) at 4ms. But I am a friendly dispute with a colleague running my kernel who claims is takes 15ms+ (we are both doing this after warm-up with a Tesla K40). I suspect his issue is with a custom RHEL, custom cuda drivers, and his "real time " thread groups , but i am not a linux expert. I know windows clocks are less than perfect, but this is too big a discrepancy. (besideds it all timing of other kernels I wrote agree with his timing, it is only the first in the chain of kernels that the time disagrees). Smells to me of something outside the kernel.

Anyway is there a way with CudeDeviceEvents (elapsed time) to add to the CUDA kernel to measure the ENTIRE kernel time from when the first block starts to the end of of the last block? I think this would get us started in figuring out where the problem is. From my reading, it looks like cuda device events are done on the host, and I am looking for something internal to the gpu.

What about nvidia visual profiler? Using nvprof you can collect profile information(number of calls, Time…) for each kernel.

$ nvprof ./Your_program

“(besideds it all timing of other kernels I wrote agree with his timing, it is only the first in the chain of kernels that the time disagrees). Smells to me of something outside the kernel.”

since we are now smelling things, i would think that it perhaps rather ‘smells’ like (jit) compilation kernel overhead

i would think that, without compilation overhead, cudaEventRecord() around a kernel launch, coupled with a cudaEventElapsedTime(), should be able to measure the execution time of the kernel with a margin of error of micro-seconds, which should be acceptable when you are arguing in/ about milliseconds

hence, you could consider pre-compiling the kernel, or, for test purposes, issue the same kernel twice, and measure around the second kernel, which should have the compilation overhead removed

perhaps