DP work scheduling

Hello,

I have an application that must continuously do double precision work over relatively short blocks/ arrays - on many occasions the block is shorter than the warp block; sometimes the block is as short as 15, or 10, or even 5 elements/ points

The overhead to amalgamate these short blocks such that the blocks are jointly processed would be too much
The other option is to increase the number of instances, leading to my question:

Exactly how are work scheduled for the double precision units: if, hypothetically, 32 DP units are available (without work), and 2 warps each have 5 threads with DP instructions due, can the 2 warps jointly execute their DP instructions, or would the one have to wait for the other?

On current NVIDIA GPUs work is issued per warp. The SM cannot schedule an instruction across two warps. The instructions would execute back to back and execute in the same duration as if all 32 threads were active in each warp.

Thank you for your reply, Greg

If I understand you correctly, the units of the SM are utilized on a warp basis?

The Kepler architecture white paper mentions 64 DP units, and 192 SP units per SM
This then would mean that up to, but not more than 2 warps can utilize the DP units simultaneously, and up to, but not more than 6 warps can utilize the SP units simultaneously, regardless of how many threads per warp are actually active at that point; correct?

Devices of compute capability of 2.1 have 48 cuda cores, implying some form of half-warp scheduling…?