Warp branching

folkert · October 25, 2010, 10:05am

I’m a bit confused about what warp branching (or divergence) really is.

Does branching occurs if threads in one warp do different things, e.g., because of conditions on the threadID? Or is it that warps do different things, e.g., by conditions on the blockIdx?

I was also wandering if it is possible (read: efficient) to let different blocks in one kernel do different things, which can be useful if the data operated on is small. (This is essentially merging two kernels).

An example:

__global__ void test()

{

  ...

  if (blockIdx.x < 15) {

	dothis();

	...

  }

  else {

	dosomethingelse();

	...

  }

}

Suppose the GPU has 30 multiprocessors, so 30 blocks can be executed concurrently, then how does the above code behave?

Do blocks 15-29 have to wait for blocks 0-14 or is all code run in parallel?

If not, is this also an example of warp diverging?

folkert · October 25, 2010, 10:05am

I’m a bit confused about what warp branching (or divergence) really is.

Does branching occurs if threads in one warp do different things, e.g., because of conditions on the threadID? Or is it that warps do different things, e.g., by conditions on the blockIdx?

I was also wandering if it is possible (read: efficient) to let different blocks in one kernel do different things, which can be useful if the data operated on is small. (This is essentially merging two kernels).

An example:

__global__ void test()

{

  ...

  if (blockIdx.x < 15) {

	dothis();

	...

  }

  else {

	dosomethingelse();

	...

  }

}

Suppose the GPU has 30 multiprocessors, so 30 blocks can be executed concurrently, then how does the above code behave?

Do blocks 15-29 have to wait for blocks 0-14 or is all code run in parallel?

If not, is this also an example of warp diverging?

HenrikAndresen · October 25, 2010, 11:10am

Warp divergence is most often encountered when you have branching, but on a thread level.

i.e.

x = pInputData[threadIdx.x];
if( x > 0.0 )
doThis();
else
doThat();

This will work fine if all your data is either the one or the other, but if it is say random data, it will almost always have to execute both functions in sequence.

Does this help?

/Henrik

HenrikAndresen · October 25, 2010, 11:10am

Warp divergence is most often encountered when you have branching, but on a thread level.

i.e.

x = pInputData[threadIdx.x];
if( x > 0.0 )
doThis();
else
doThat();

This will work fine if all your data is either the one or the other, but if it is say random data, it will almost always have to execute both functions in sequence.

Does this help?

/Henrik

seibert · October 25, 2010, 12:30pm

“Warp divergence” specifically refers to threads within a warp taking different execution paths at a branch point. Entire warps and blocks can branch arbitrarily with no performance penalty, so having different blocks do completely different tasks should be no problem. The only downside to this approach (sometimes called a “fat kernel”) is that the kernel runs until the last block finishes, so generally you want each block to have roughly equal runtime.

seibert · October 25, 2010, 12:30pm

“Warp divergence” specifically refers to threads within a warp taking different execution paths at a branch point. Entire warps and blocks can branch arbitrarily with no performance penalty, so having different blocks do completely different tasks should be no problem. The only downside to this approach (sometimes called a “fat kernel”) is that the kernel runs until the last block finishes, so generally you want each block to have roughly equal runtime.

Smokey · October 25, 2010, 10:23pm

On Compute 1.x devices (which is the vast majority), doesn’t matter if the blocks take the same time to execute - either way you’re going to hide the majority of execution time for all but the longest running block(s) (ignoring possible contention between blocks on hardware resources, having unanticipated side-effects on performance).

I’ve seen some pretty impressive performance improvements doing this (up to an order of magnitude in some cases), simply because the workloads of the kernels were so small they couldn’t really spread across many blocks - or they had to run on a single block due to synchronization issues that are impossible to handle in multi-block kernels (common for deterministic algorithms with unknown output sizes) & the lack of asynchronous kernel execution on 1.x devices :)

The major downside I see is the complications of merging 2+ single-block kernels into a single multi-block kernel, in terms of wasted smem, limited instructions (there’s an upper limit on how many kernels you can merge into one), cache thrashing of TMUs / cmem, etc…

Smokey · October 25, 2010, 10:23pm

On Compute 1.x devices (which is the vast majority), doesn’t matter if the blocks take the same time to execute - either way you’re going to hide the majority of execution time for all but the longest running block(s) (ignoring possible contention between blocks on hardware resources, having unanticipated side-effects on performance).

I’ve seen some pretty impressive performance improvements doing this (up to an order of magnitude in some cases), simply because the workloads of the kernels were so small they couldn’t really spread across many blocks - or they had to run on a single block due to synchronization issues that are impossible to handle in multi-block kernels (common for deterministic algorithms with unknown output sizes) & the lack of asynchronous kernel execution on 1.x devices :)

The major downside I see is the complications of merging 2+ single-block kernels into a single multi-block kernel, in terms of wasted smem, limited instructions (there’s an upper limit on how many kernels you can merge into one), cache thrashing of TMUs / cmem, etc…

seibert · October 26, 2010, 1:37am

And each thread will have to use the maximum number of registers required to complete any code path.

seibert · October 26, 2010, 1:37am

And each thread will have to use the maximum number of registers required to complete any code path.

zjx1020 · October 26, 2010, 2:36am

if()

else() results in divergence.

And only

if()

doThis();

Will it results in divergence?

zjx1020 · October 26, 2010, 2:36am

if()

else() results in divergence.

And only

if()

doThis();

Will it results in divergence?

Topic		Replies	Views
Branching in kernel CUDA Programming and Performance	3	5356	June 5, 2008
Diverge-free doesn't win 32x over Diverge-all warp divergence CUDA Programming and Performance	6	3132	September 14, 2007
Warp divergence between two or more blocks? CUDA Programming and Performance	1	379	March 7, 2020
How many divergent branches can actually be discussed in parallel? CUDA Programming and Performance	5	3058	October 1, 2009
Warp divergence problem implictly solved by launching multiple concurrent kernels? CUDA Programming and Performance cuda , kernel , performance	0	596	August 14, 2020
Blocks and Warps CUDA Programming and Performance	2	769	July 29, 2011
branch diveragence with if/while same as if one of the threads in a warp returning CUDA Programming and Performance	18	2824	December 13, 2011
Must all threads execute the same code? "Branch divergence occurs only within a warp" CUDA Programming and Performance	5	2986	December 28, 2008
Thread Divergence CUDA Programming and Performance	5	2760	June 1, 2010
Branching? CUDA Programming and Performance	7	3171	March 16, 2012

Warp branching

Related topics