Short duration kernel measurement issues on the TK1?

I’m benchmarking some kernels on the TK1 and find the cudaEvent measurements are imprecise on very short duration kernels (0.5 ms.)

I get expected measurements only when I significantly scale up the test.

Is this a result of tight DVFS management or is it a precision issue with the cudaEvents?

Can someone at NVIDIA explain this or describe a workaround?

I’m running these benchmarks over ssh and the board is running headless on the network.

This post by Puget Systems also has me concerned (final paragraph):

After adding a few options to my benchmark, it’s clear that the TK1’s clock speed is being tightly managed.

This is exactly what it should be doing in order to save power.

I have no way of measuring actual GPU clock speed but a “warm-up” routine really helps push the GPU to a higher performance level.

But unlike desktop GPUs, there appears to be very very little hysteresis when the SMX is at its highest clock speed and, I’m guessing, that certain blocking operations like host-initiated memcpy’s and event synchronization are opportunities for the TK1 to immediately downclock the GPU.

None of this is a surprise, but without tools to monitor the clocks it’s tough to benchmark.

The takeaway is that benchmarking for peak performance on the TK1 will probably always require some strategy… unless NVIDIA releases a dev tool to adjust the frequency scaling.

Hopefully there is some forthcoming documentation on this subject.