Warp thread Scheduling

satyam_shivam · June 2, 2010, 1:12pm

Hello Everyone,

When a warp is issued to a SM, are all the threads executed on one core (8 warps can be executed simultaneously on one SM) or divided among all the cores on one SM ?

I guess,they are divided among cores. If so, can someone please explain the sequence in which the threads from a warp are issued to cores ?

(Assuming 8 cores/SM) Is it like :

warp1
thread 0,1,2 …7 on core 1 ( one after the other), thread 8,9,10…15 on core 2 , thread 16,17…23 on core 3,thread 24 …31 on core 4

warp2
thread 0,1,2 …7 on core 5, thread 8,9,10…15 on core 6 , thread 16,17…23 on core 7,thread 24 …31 on core 8

This approach would ensure that two warps are executed simultaneously on one SM.

OR

All the threads from only one warp are distributed across all the cores on one SM. Each warp is divided into 4 parts.
thread 0,8,16,24 on core 1(one after the other), thread 1,9,16,25 on core 2 and so on …

I guess ,it should be approach 2. But I am not sure. Can some one please help me with this !

Thanks and Regards

rodrigob · June 2, 2010, 2:47pm

Hello Everyone,

When a warp is issued to a SM, are all the threads executed on one core (8 warps can be executed simultaneously on one SM) or divided among all the cores on one SM ?

I guess,they are divided among cores. If so, can someone please explain the sequence in which the threads from a warp are issued to cores ?

(Assuming 8 cores/SM) Is it like :

warp1

thread 0,1,2 …7 on core 1 ( one after the other), thread 8,9,10…15 on core 2 , thread 16,17…23 on core 3,thread 24 …31 on core 4

warp2

thread 0,1,2 …7 on core 5, thread 8,9,10…15 on core 6 , thread 16,17…23 on core 7,thread 24 …31 on core 8

This approach would ensure that two warps are executed simultaneously on one SM.

OR

All the threads from only one warp are distributed across all the cores on one SM. Each warp is divided into 4 parts.

thread 0,8,16,24 on core 1(one after the other), thread 1,9,16,25 on core 2 and so on …

I guess ,it should be approach 2. But I am not sure. Can some one please help me with this !

Thanks and Regards

could you tell us why do you need to know that ?

Up to my understanding which thread runs where and when is undefined (which means “it depends and we give no guarantees that it will run the same between two computers”). Take at look at the Cuda programming guide, search for “undefined”.

Most probably what you are trying to do can be solved in a different way.

satyam_shivam · June 3, 2010, 3:41am

I am not actually doing anything. I am just trying to understand the execution model.

I tried searching for “undefined” but could not find anything. Can you please tell me which version and which page are you referring to ?

rodrigob · June 8, 2010, 4:31pm

searching in google for “Cuda programming guide undefined” returns the version 2.0 where “undefined” is mentioned 10 times.

MisterAnderson42 · June 9, 2010, 11:49am

This is well defined in the programming guide, section G.3.1 for compute 1.x and later for 2.x:

For devices of compute capability 1.x, a multiprocessor consists of:

8 CUDA cores for integer and single-precision floating-point arithmetic operations,

ï± 1 double-precision floating-point unit for double-precision floating-point arithmetic operations,

ï± 2 special function units for single-precision floating-point transcendental functions (these units can also handle single-precision floating-point multiplications),

ï± 1 warp scheduler.

To execute an instruction for all threads of a warp, the warp scheduler must therefore issue the instruction over:

ï± 4 clock cycles for an integer or single-precision floating-point arithmetic instruction,

ï± 32 clock cycles for a double-precision floating-point arithmetic instruction,

ï± 16 clock cycles for a single-precision floating-point transcendental instruction.

In other words, the 8 cores of an MP (compute 1.x) execute the same instruction for the same warp 4 times, once for each quarter warp.

satyam_shivam · June 9, 2010, 5:43pm

Thank you very much for your efforts!

I was able to find the term “undefined”,but none of them seemed to answer my question !

Probably I need to read more carefully !

Thanks anyway !!

satyam_shivam · June 9, 2010, 5:54pm

Thanks Anderson ! That helped …

I had read earlier the same thing, but somehow did not understand the concept. But when you put it in your words, it was very clear.

parallelis · June 28, 2010, 2:10am

Yeap, but it’s just different on Fermi, so the better is to make no assumption on the executing model, except that all 32 threads of a warp are executed together, may be interlaced in 2-way or 4-way. Anyway, you’d better consider that 32 threads of a warp are grouped in any way…

Topic		Replies	Views
About Warps how Warps are allocated to SP/SM CUDA Programming and Performance	2	8325	September 11, 2009
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15593	February 4, 2011
Thread Scheduling Concept CUDA Programming and Performance	3	3721	June 21, 2012
Warp threads execution model CUDA Programming and Performance	8	2770	January 19, 2010
Basic question about warps CUDA Programming and Performance	14	6597	June 9, 2009
Warp Size Question CUDA Programming and Performance	21	13967	June 18, 2010
Can threads in a warp from different blocks? CUDA Programming and Performance	17	11845	March 26, 2010
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28712	July 4, 2019
How many parallel threads? CUDA Programming and Performance	19	10000	October 1, 2021
questions about sp and sm CUDA Programming and Performance	5	4025	June 19, 2019

Warp thread Scheduling

Related topics