The CUDA Programming Guide recommends having at least twice as many blocks as available multiprocessors in order for the multiprocessors to be able to switch to a second block if all the threads in the current block are waiting for memory transfers to complete. Just out of curiosity… Are context switches between blocks just as fast as switches between thread warps within a block?
Hi, thanks for your comments. I think I read somewhere though that it is sometimes an advantage to also have multiple active blocks per SM. For example, having 512 active threads in two active blocks of 256 might be better than having 512 threads in a single block. The reason for this is that the SM can switch blocks when all threads in one block are waiting on a syncthreads(), which the threads in the other block will not be dependent on.