nvvp: count cycles where no warp is runnable not possible currently, but would be really helpful

While optimizing a kernel and trying check whether increasing occupancy would help, something stroke me:
It would be really helpful if the profiler had a counter for cycles where no warp is ready to run and which are therefor wasted.

I don’t think there currently is a way to directly get at that information (I would be happy though if proven wrong). However it would be really helpful as it would be the straightforward indicator whether improving occupancy would be of any benefit. And I think it should be quite easy to add such a counter to the hardware (even though that means waiting for a few generations of GPUs until we get support for that).

Nvidia currently puts a lot of emphasis on occupancy (although matters are improving thanks to Vasily Volkov). However occupancy is not the quantity of interest, wasted GPU cycles are. Occupancy at best is an indirect indicator for that. Direct measurement would be a lot more helpful, and should be relatively simple to implement in hardware.

I don’t know if Nvidia engineers could extract the number from the already existing performance counters like instructions (and clock), which would of course be even more welcome as it means not having to wait for new hardware.

Any interesting changes about this in CUDA 4.2? The command line profiles documentation hasn’t been updated at all, and it appears you can’t see the profiling counter descriptions in nvvp unless you have a device of matching compute capability in your computer. Any reason why the (nice) profiler documentation cannot be packaged as a PDF document in the doc/ dir like the other manuals?

Progress!

According to Greg Smith’s comment on StackOverflow, the info is available under Nsight VSE, and he has filed an RFE to include it in CUPTI and the CUPTI-based profilers as well.