I think you might have some confusion about how scheduling works on CUDA. Blocks (potentially more than 1 if resources allow) are assigned to each multiprocessor and run to completion. New blocks can only be scheduled to a multiprocessor once one of the existing blocks finishes. There is no preemptive multitasking of blocks.
Within a multiprocessor, the warp scheduler cycles through all the available warps (regardless of which block they come from) and selects one to run its next instruction. If the warp is waiting for a memory transaction to complete, the warp scheduler will ignore it. As long as you have enough warps available for the scheduler to pick from, there will always be something for the CUDA cores to do, even if some warps are stalled waiting for memory reads. This is what the PTX manual is referring to. In general, you want many threads on each multiprocessor, either through large blocks or many smaller blocks, to keep the CUDA core pipelines fed.
Hi seibert, thanks a lot for the really quick reply!
Yes, my idea is to have multiple thread blocks assigned per each multiprocessor (many small warp-sized blocks to be precise ). You mentioned “If the warp is waiting for a memory transaction to complete, the warp scheduler will ignore it…” does that mean if a particular warp (== block) executes a memory operation, the scheduler will switch to a different warp (== block)? If that is the case then I should be able to accomplish this isn’t it?
Clarification: If I understand CUDA scheduling correctly, multiple thread blocks can reside inside the same multiprocessor while the scheduler switches between different warps. My intention is to force this switch in some controllable fashion (as stated above).
It seems I haven’t clearly understood the difference between “resident” and “non-resident” (but assigned to the same SM) thread blocks. So the question should be about switching between two resident (warp-sized) thread blocks. Sorry for the confusion