%globaltimer update frequency


I am trying to measure the latency of the inner loop of a persistent kernel.
I came across this StackOverflow answer that explained how to perform latency measurements in a kernel: time - How to convert CUDA clock cycles to milliseconds? - Stack Overflow

In this post, it is stated:

The default resolution is 32ns with update every µs. The NVIDIA performance tools force the update to every 32 ns (or 31.25 MHz)

I’m trying to measure is a very short latency (~ 1µs), so the microsecond update is not enough.
Is there a way to enable the 32 ns update programmatically, without using a whole profiling tool ? I have been looking for a solution in NVML and CUPTI but without success as of now.

Thanks for any information,