Warp thread Scheduling

Hello Everyone,

When a warp is issued to a SM, are all the threads executed on one core (8 warps can be executed simultaneously on one SM) or divided among all the cores on one SM ?

I guess,they are divided among cores. If so, can someone please explain the sequence in which the threads from a warp are issued to cores ?

(Assuming 8 cores/SM) Is it like :

warp1
thread 0,1,2 …7 on core 1 ( one after the other), thread 8,9,10…15 on core 2 , thread 16,17…23 on core 3,thread 24 …31 on core 4

warp2
thread 0,1,2 …7 on core 5, thread 8,9,10…15 on core 6 , thread 16,17…23 on core 7,thread 24 …31 on core 8

This approach would ensure that two warps are executed simultaneously on one SM.

OR

All the threads from only one warp are distributed across all the cores on one SM. Each warp is divided into 4 parts.
thread 0,8,16,24 on core 1(one after the other), thread 1,9,16,25 on core 2 and so on …

I guess ,it should be approach 2. But I am not sure. Can some one please help me with this !

Thanks and Regards

could you tell us why do you need to know that ?

Up to my understanding which thread runs where and when is undefined (which means “it depends and we give no guarantees that it will run the same between two computers”). Take at look at the Cuda programming guide, search for “undefined”.

Most probably what you are trying to do can be solved in a different way.

I am not actually doing anything. I am just trying to understand the execution model.

I tried searching for “undefined” but could not find anything. Can you please tell me which version and which page are you referring to ?

searching in google for “Cuda programming guide undefined” returns the version 2.0 where “undefined” is mentioned 10 times.

This is well defined in the programming guide, section G.3.1 for compute 1.x and later for 2.x:

In other words, the 8 cores of an MP (compute 1.x) execute the same instruction for the same warp 4 times, once for each quarter warp.

Thank you very much for your efforts!

I was able to find the term “undefined”,but none of them seemed to answer my question !

Probably I need to read more carefully !

Thanks anyway !!

Thanks Anderson ! That helped …

I had read earlier the same thing, but somehow did not understand the concept. But when you put it in your words, it was very clear.

Yeap, but it’s just different on Fermi, so the better is to make no assumption on the executing model, except that all 32 threads of a warp are executed together, may be interlaced in 2-way or 4-way. Anyway, you’d better consider that 32 threads of a warp are grouped in any way…