i am currently trying to optimize a kernel, and, since i’m rather new to CUDA yet, came to a question about the scheduling behavior.
I’m in the situation that an increasing number of threads in the block become idle since their calculation is done, and I’m running more than enough blocks to keep the SMs busy. What i’m asking myself now is, will full warps of a block that are done with their calculations stop being scheduled, leading to more processor time for the other warps? Or does the scheduler work only on a block-basis, meaning that I would have to “kill” a whole block, not only a warp, to gain any performance?
Currently, I’m actually checking only the fastest conditions right before accessing memory, since everything else would only slow down the threads that do meaningful work while bringing no performance gain (from my understanding).
The scheduler won’t waste any cycles at all on warps that have completed.
It also wastes no cycles on warps that are suspended while waiting for memory read results.
The inefficiency you worry about is thread divergence… where only some of the THREADS of a warp are alive. That is indeed inefficient. A warp with only 1 active thread executes at the same speed as a warp with 32 live threads. This is true if the threads are completed (finished the kernel) or even temporarily suspended (during a divergent branch in code execution.)
if i manage to get 32 threads that all belong to the same warp to finish their kernel, then the warp won’t be called again?
stupid example, but beautifully straightforward ;)
the kernel is:
if (threadIdx.x > 31)
then the warp containing threads 0-31 will be scheduled once, fail at the if-condition, and never be scheduled again?
and is there any way to apply this for a for-loop too? e.g. when some threads break out of the loop earlier than others