DP work scheduling

little_jimmy · August 10, 2014, 5:55pm

Hello,

I have an application that must continuously do double precision work over relatively short blocks/ arrays - on many occasions the block is shorter than the warp block; sometimes the block is as short as 15, or 10, or even 5 elements/ points

The overhead to amalgamate these short blocks such that the blocks are jointly processed would be too much
The other option is to increase the number of instances, leading to my question:

Exactly how are work scheduled for the double precision units: if, hypothetically, 32 DP units are available (without work), and 2 warps each have 5 threads with DP instructions due, can the 2 warps jointly execute their DP instructions, or would the one have to wait for the other?

Greg · August 11, 2014, 9:41pm

On current NVIDIA GPUs work is issued per warp. The SM cannot schedule an instruction across two warps. The instructions would execute back to back and execute in the same duration as if all 32 threads were active in each warp.

little_jimmy · August 12, 2014, 12:42pm

Thank you for your reply, Greg

If I understand you correctly, the units of the SM are utilized on a warp basis?

The Kepler architecture white paper mentions 64 DP units, and 192 SP units per SM
This then would mean that up to, but not more than 2 warps can utilize the DP units simultaneously, and up to, but not more than 6 warps can utilize the SP units simultaneously, regardless of how many threads per warp are actually active at that point; correct?

Devices of compute capability of 2.1 have 48 cuda cores, implying some form of half-warp scheduling…?

Topic		Replies	Views
Warp scheduling - have I got this right? CUDA Programming and Performance	17	12254	February 12, 2013
Warp Size Question CUDA Programming and Performance	21	14079	June 18, 2010
About the number of CUDA cores in SMSP, less or gerater than warp threads number(32) CUDA Programming and Performance	8	886	June 17, 2024
How the 16 int cores in a processing block in SM execute when 32 integers in a warp is calculated? CUDA Programming and Performance cuda , board-design	4	1138	September 28, 2023
Understanding CUDA scheduling CUDA Programming and Performance	4	15783	May 20, 2014
thread, warp, block, grid, device CUDA Programming and Performance	3	6778	November 25, 2016
Relationship between Threads and GPU core/units CUDA Programming and Performance	5	6467	November 21, 2015
Thread Scheduling Concept CUDA Programming and Performance	3	3766	June 21, 2012
Understanding fermi warp scheduler CUDA Programming and Performance	0	2391	December 2, 2011
Double precision warps CUDA Programming and Performance	0	3542	July 31, 2009

DP work scheduling

Related topics