I’m benchmarking some kernels on the TK1 and find the cudaEvent measurements are imprecise on very short duration kernels (0.5 ms.)
I get expected measurements only when I significantly scale up the test.
Is this a result of tight DVFS management or is it a precision issue with the cudaEvents?
Can someone at NVIDIA explain this or describe a workaround?
I’m running these benchmarks over ssh and the board is running headless on the network.
This post by Puget Systems also has me concerned (final paragraph):