I’ve read through the programming guide and the best practices guide, but didn’t manage to discover the exact rules about warp reuse. It’s my understanding that when a warp is waiting on a sync, other threads can receive processing in that space during the wait. My question is this: which threads are candidates? Is it only other threads within the same block, or can threads from a separate block be processed in that space?
I hate to double-post, but it seems like this should be an easy question to me. Should I have posted in the programming forum? I did some more research, but nothing is perfectly clear on the subject.
The scheduler on the multiprocessor will time slice between active warps, regardless of which blocks they come from. This is why you can hide memory latency either by having large blocks, or by having multiple smaller blocks active on the same multiprocessor.
However, once a block starts, it must run to completion, so the scheduler cannot swap other blocks onto the multiprocessor to cover for idle warps. The occupancy calculator spreadsheet can help you figure out how many simultaneous blocks can fit onto a multiprocessor for your kernel and its resource requirements.
(Also, I think your question is appropriate for this forum. You probably didn’t get an answer because the post volume has grown quite a bit here, and many readers can’t read every post anymore. Sometimes you just get unlucky. :) )
Thanks for the reply. Can I take this to mean that a whole block must be activated at once – i.e. no starting just a few warps at a time as space becomes available? (In my design some warps would ‘terminate’ much sooner than others, but I suppose they still have to sync up with the rest of the block at the end. Is this still true even if they do no output? I assume so.) From what you tell me I think my interpretations are right, but I just need to make sure because otherwise I think I may be able to get much better performance.
Yes, blocks are scheduled to run on multiprocessors in their entirety. Even if threads or warps terminate early, another block is not scheduled until the entire block terminates.