Thread and Instruction Scheduling


I am trying to understand the execution model and instruction and thread scheduling in cuda. Also, I am trying to understand the overlapping of computation and data movement from global to shared memory. Within a warp of size 32, we have 4 sets of 8 threads. How is the execution cycled through these 4 threads. Is there a periodic timeslicing in a round robin manner, what is the granularity of this time slice? Also, is there a single program counter maintained per warp? Across warps and thread blocks, how is the execution progression made and what is the granularity of the time slice?

Thanks in advance,


Threadblocks as well as warps within them are scheduled by the run-time scheduler, which bases its decisions on a number of factors. The warp and block order is not guaranteed (as mentioned in the Programming Guide). You code shouldn’t rely on any particular warp/block order. If that’s a problem, I’d like to hear more details.

If there are warps available, the scheduler will swap them in if some warp blocks due to a move between global and shared memories. That’s how multiple warps are used to hide the latency of reading global memory.


Hi Paulius,

Thanks for your response. Though your post answers these, I just wanted to reconfirm the following:

  1. Is there any time-slicing between warps, or is it the case that a warp occupies the multiprocessor until it blocks due to a memory access at which time it is immediately swapped out if another warp is active and ready to be scheduled?

  2. Is there a program counter per warp or per 8 sets of threads in a warp.

My code doesnt rely on a particular execution order. I just wanted to understand the scheduling part to understand the performance characteristics.



I’m not familiar with the scheduler internals. I believe the latter is true, I’m not sure whether time slicing is applied or not.

Per warp, as warp is the execution unit.