Block context switch penalty?


The CUDA Programming Guide recommends having at least twice as many blocks as available multiprocessors in order for the multiprocessors to be able to switch to a second block if all the threads in the current block are waiting for memory transfers to complete. Just out of curiosity… Are context switches between blocks just as fast as switches between thread warps within a block?


in fact, overhead of context switch is zero.

suppose one SM can have two active blocks, each block has 512 threads, then

there are 1024 threads in one SM. Scheduler would divide 1024 threads to 32 warps

and schedule a set of warps via round-robin.

if you have more warps in one SM, then arithmetical computation can be hidden in memory latency,

this is helpful when you deal with memory-bound.

So the trick is how many active warps per SM, not how many blocks per SM.

I draw a Gatt chart in the thread…rt=#entry600637

you can look at it and maybe know how to hide arithmetic operation in memory latency.

Hi, thanks for your comments. I think I read somewhere though that it is sometimes an advantage to also have multiple active blocks per SM. For example, having 512 active threads in two active blocks of 256 might be better than having 512 threads in a single block. The reason for this is that the SM can switch blocks when all threads in one block are waiting on a syncthreads(), which the threads in the other block will not be dependent on.

That’s an interesting thread, thanks!