Warp branching

I’m a bit confused about what warp branching (or divergence) really is.

Does branching occurs if threads in one warp do different things, e.g., because of conditions on the threadID? Or is it that warps do different things, e.g., by conditions on the blockIdx?

I was also wandering if it is possible (read: efficient) to let different blocks in one kernel do different things, which can be useful if the data operated on is small. (This is essentially merging two kernels).

An example:

__global__ void test()

{

  ...

  if (blockIdx.x < 15) {

	dothis();

	...

  }

  else {

	dosomethingelse();

	...

  }

}

Suppose the GPU has 30 multiprocessors, so 30 blocks can be executed concurrently, then how does the above code behave?

Do blocks 15-29 have to wait for blocks 0-14 or is all code run in parallel?

If not, is this also an example of warp diverging?

I’m a bit confused about what warp branching (or divergence) really is.

Does branching occurs if threads in one warp do different things, e.g., because of conditions on the threadID? Or is it that warps do different things, e.g., by conditions on the blockIdx?

I was also wandering if it is possible (read: efficient) to let different blocks in one kernel do different things, which can be useful if the data operated on is small. (This is essentially merging two kernels).

An example:

__global__ void test()

{

  ...

  if (blockIdx.x < 15) {

	dothis();

	...

  }

  else {

	dosomethingelse();

	...

  }

}

Suppose the GPU has 30 multiprocessors, so 30 blocks can be executed concurrently, then how does the above code behave?

Do blocks 15-29 have to wait for blocks 0-14 or is all code run in parallel?

If not, is this also an example of warp diverging?

Warp divergence is most often encountered when you have branching, but on a thread level.

i.e.

x = pInputData[threadIdx.x];
if( x > 0.0 )
doThis();
else
doThat();

This will work fine if all your data is either the one or the other, but if it is say random data, it will almost always have to execute both functions in sequence.

Does this help?

/Henrik

Warp divergence is most often encountered when you have branching, but on a thread level.

i.e.

x = pInputData[threadIdx.x];
if( x > 0.0 )
doThis();
else
doThat();

This will work fine if all your data is either the one or the other, but if it is say random data, it will almost always have to execute both functions in sequence.

Does this help?

/Henrik

“Warp divergence” specifically refers to threads within a warp taking different execution paths at a branch point. Entire warps and blocks can branch arbitrarily with no performance penalty, so having different blocks do completely different tasks should be no problem. The only downside to this approach (sometimes called a “fat kernel”) is that the kernel runs until the last block finishes, so generally you want each block to have roughly equal runtime.

“Warp divergence” specifically refers to threads within a warp taking different execution paths at a branch point. Entire warps and blocks can branch arbitrarily with no performance penalty, so having different blocks do completely different tasks should be no problem. The only downside to this approach (sometimes called a “fat kernel”) is that the kernel runs until the last block finishes, so generally you want each block to have roughly equal runtime.

On Compute 1.x devices (which is the vast majority), doesn’t matter if the blocks take the same time to execute - either way you’re going to hide the majority of execution time for all but the longest running block(s) (ignoring possible contention between blocks on hardware resources, having unanticipated side-effects on performance).

I’ve seen some pretty impressive performance improvements doing this (up to an order of magnitude in some cases), simply because the workloads of the kernels were so small they couldn’t really spread across many blocks - or they had to run on a single block due to synchronization issues that are impossible to handle in multi-block kernels (common for deterministic algorithms with unknown output sizes) & the lack of asynchronous kernel execution on 1.x devices :)

The major downside I see is the complications of merging 2+ single-block kernels into a single multi-block kernel, in terms of wasted smem, limited instructions (there’s an upper limit on how many kernels you can merge into one), cache thrashing of TMUs / cmem, etc…

On Compute 1.x devices (which is the vast majority), doesn’t matter if the blocks take the same time to execute - either way you’re going to hide the majority of execution time for all but the longest running block(s) (ignoring possible contention between blocks on hardware resources, having unanticipated side-effects on performance).

I’ve seen some pretty impressive performance improvements doing this (up to an order of magnitude in some cases), simply because the workloads of the kernels were so small they couldn’t really spread across many blocks - or they had to run on a single block due to synchronization issues that are impossible to handle in multi-block kernels (common for deterministic algorithms with unknown output sizes) & the lack of asynchronous kernel execution on 1.x devices :)

The major downside I see is the complications of merging 2+ single-block kernels into a single multi-block kernel, in terms of wasted smem, limited instructions (there’s an upper limit on how many kernels you can merge into one), cache thrashing of TMUs / cmem, etc…

And each thread will have to use the maximum number of registers required to complete any code path.

And each thread will have to use the maximum number of registers required to complete any code path.

if()

else() results in divergence.

And only

if()

doThis();

Will it results in divergence?

if()

else() results in divergence.

And only

if()

doThis();

Will it results in divergence?