If I launch enough thread blocks to fully utilize my GPU, will a new thread block only be started when each Warp or Thread of an currently active block has finished? Or does the scheduling happen at a lower level, e.g. at Warp level? (Also is there any reference for this, e.g. in the Programming Guide, that I might’ve overseen?)
Does the Independent Thread Scheduling introduced with Volta also has a take on this, or is this just relevant for inner-Warp scheduling?
Reason I ask is, I have a kernel which runs faster with block size 16 compared to 32, which was quiet surprising to me, since it’s only half a warp then and thus probably half of the threads doing nothing. The kernel is currently stalling a lot since there are many sub-optimal memory requests, so gaining the scheduling flexibility of having more blocks with less size might be an appropriate explanation of this behavior. But to be sure of this, I need to have a better understanding of block scheduling. :)
In order for a new threadblock to be scheduled (i.e. deposited, so that its warps can be issued), sufficient resources on the SM must “free up”. A new threadblock will not be deposited until there are sufficient resources for the entire threadblock. The CWD or block scheduler does not deposit a “warp at a time” or any other granularity. There must be enough space for a block.
So how exactly do resources get released? At what granularity? Are the resources for a block released only when the entire block is finished, or at some other granularity? (e.g. as a warp retires, or as a thread retires, etc.) As far as I know this is not specified. However, it is possible to run a carefully designed experiment to show that in some cases, the resources associated with a block may be freed up “incrementally” as threads exit. In that case, it is possible for a new block to become “schedulable” as a result of threads exiting in a previous threadblock, even if not all of the threads have exited. This is/was true even prior to Volta, I believe. But I do believe the behavior varies somewhat by GPU architecture.
I don’t know if this actually has any bearing on your case, or not.
Thank you very much for the thorough clarification!
To conclude in my own words, when having larger blocks with huge amounts of load imbalance inside a block (e.g. due to memory stalling?), it might actually hurt performance if the resources it takes are freed only if every warp of the block is finished. But it might also be that the resources of some warps of a block gets freed earlier, so a new block can be started. We however cannot be sure which one is the case (without a more sophisticated experiment). Right?
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.