performance gain by "killing" warps can there be any?

Hi everyone,

i am currently trying to optimize a kernel, and, since i’m rather new to CUDA yet, came to a question about the scheduling behavior.
I’m in the situation that an increasing number of threads in the block become idle since their calculation is done, and I’m running more than enough blocks to keep the SMs busy. What i’m asking myself now is, will full warps of a block that are done with their calculations stop being scheduled, leading to more processor time for the other warps? Or does the scheduler work only on a block-basis, meaning that I would have to “kill” a whole block, not only a warp, to gain any performance?
Currently, I’m actually checking only the fastest conditions right before accessing memory, since everything else would only slow down the threads that do meaningful work while bringing no performance gain (from my understanding).

Thanks in advance,

The scheduler won’t waste any cycles at all on warps that have completed.

It also wastes no cycles on warps that are suspended while waiting for memory read results.

The inefficiency you worry about is thread divergence… where only some of the THREADS of a warp are alive. That is indeed inefficient. A warp with only 1 active thread executes at the same speed as a warp with 32 live threads. This is true if the threads are completed (finished the kernel) or even temporarily suspended (during a divergent branch in code execution.)

ok, so let me try and get this straight:

if i manage to get 32 threads that all belong to the same warp to finish their kernel, then the warp won’t be called again?

stupid example, but beautifully straightforward ;)
the kernel is:
if (threadIdx.x > 31)
do stuff;

then the warp containing threads 0-31 will be scheduled once, fail at the if-condition, and never be scheduled again?
and is there any way to apply this for a for-loop too? e.g. when some threads break out of the loop earlier than others

Correct. Branches on warp boundaries have no performance penalty. This is sometimes used to implement a “fat kernel” which actually performs several completely different activities.

Maybe a little bit off topic, but what happens in the following case:



   for (unsigned i = 0; i < 32; i++)


	  if (threadIdx.x <= i) goto end;


	  // do some work



   // finish up


Will this be treated as 32 “separate” code paths that needs to be serialized individually? Or as just two divergent code paths “waiting at label end” and “still in for loop”.


The latter as far as I know, at each run of the for loop there will be an extra thread inactive.