While optimizing a kernel and trying check whether increasing occupancy would help, something stroke me:
It would be really helpful if the profiler had a counter for cycles where no warp is ready to run and which are therefor wasted.
I don’t think there currently is a way to directly get at that information (I would be happy though if proven wrong). However it would be really helpful as it would be the straightforward indicator whether improving occupancy would be of any benefit. And I think it should be quite easy to add such a counter to the hardware (even though that means waiting for a few generations of GPUs until we get support for that).
Nvidia currently puts a lot of emphasis on occupancy (although matters are improving thanks to Vasily Volkov). However occupancy is not the quantity of interest, wasted GPU cycles are. Occupancy at best is an indirect indicator for that. Direct measurement would be a lot more helpful, and should be relatively simple to implement in hardware.
I don’t know if Nvidia engineers could extract the number from the already existing performance counters like instructions (and clock), which would of course be even more welcome as it means not having to wait for new hardware.