I am trying to measure the latency of the inner loop of a persistent kernel.
I came across this StackOverflow answer that explained how to perform latency measurements in a kernel: time - How to convert CUDA clock cycles to milliseconds? - Stack Overflow
In this post, it is stated:
The default resolution is 32ns with update every µs. The NVIDIA performance tools force the update to every 32 ns (or 31.25 MHz)
I’m trying to measure is a very short latency (~ 1µs), so the microsecond update is not enough.
Is there a way to enable the 32 ns update programmatically, without using a whole profiling tool ? I have been looking for a solution in NVML and CUPTI but without success as of now.
Thanks for any information,