I’m a bit confused about what warp branching (or divergence) really is.
Does branching occurs if threads in one warp do different things, e.g., because of conditions on the threadID? Or is it that warps do different things, e.g., by conditions on the blockIdx?
I was also wandering if it is possible (read: efficient) to let different blocks in one kernel do different things, which can be useful if the data operated on is small. (This is essentially merging two kernels).
I’m a bit confused about what warp branching (or divergence) really is.
Does branching occurs if threads in one warp do different things, e.g., because of conditions on the threadID? Or is it that warps do different things, e.g., by conditions on the blockIdx?
I was also wandering if it is possible (read: efficient) to let different blocks in one kernel do different things, which can be useful if the data operated on is small. (This is essentially merging two kernels).
Warp divergence is most often encountered when you have branching, but on a thread level.
i.e.
x = pInputData[threadIdx.x];
if( x > 0.0 )
doThis();
else
doThat();
This will work fine if all your data is either the one or the other, but if it is say random data, it will almost always have to execute both functions in sequence.
Warp divergence is most often encountered when you have branching, but on a thread level.
i.e.
x = pInputData[threadIdx.x];
if( x > 0.0 )
doThis();
else
doThat();
This will work fine if all your data is either the one or the other, but if it is say random data, it will almost always have to execute both functions in sequence.
“Warp divergence” specifically refers to threads within a warp taking different execution paths at a branch point. Entire warps and blocks can branch arbitrarily with no performance penalty, so having different blocks do completely different tasks should be no problem. The only downside to this approach (sometimes called a “fat kernel”) is that the kernel runs until the last block finishes, so generally you want each block to have roughly equal runtime.
“Warp divergence” specifically refers to threads within a warp taking different execution paths at a branch point. Entire warps and blocks can branch arbitrarily with no performance penalty, so having different blocks do completely different tasks should be no problem. The only downside to this approach (sometimes called a “fat kernel”) is that the kernel runs until the last block finishes, so generally you want each block to have roughly equal runtime.
On Compute 1.x devices (which is the vast majority), doesn’t matter if the blocks take the same time to execute - either way you’re going to hide the majority of execution time for all but the longest running block(s) (ignoring possible contention between blocks on hardware resources, having unanticipated side-effects on performance).
I’ve seen some pretty impressive performance improvements doing this (up to an order of magnitude in some cases), simply because the workloads of the kernels were so small they couldn’t really spread across many blocks - or they had to run on a single block due to synchronization issues that are impossible to handle in multi-block kernels (common for deterministic algorithms with unknown output sizes) & the lack of asynchronous kernel execution on 1.x devices :)
The major downside I see is the complications of merging 2+ single-block kernels into a single multi-block kernel, in terms of wasted smem, limited instructions (there’s an upper limit on how many kernels you can merge into one), cache thrashing of TMUs / cmem, etc…
On Compute 1.x devices (which is the vast majority), doesn’t matter if the blocks take the same time to execute - either way you’re going to hide the majority of execution time for all but the longest running block(s) (ignoring possible contention between blocks on hardware resources, having unanticipated side-effects on performance).
I’ve seen some pretty impressive performance improvements doing this (up to an order of magnitude in some cases), simply because the workloads of the kernels were so small they couldn’t really spread across many blocks - or they had to run on a single block due to synchronization issues that are impossible to handle in multi-block kernels (common for deterministic algorithms with unknown output sizes) & the lack of asynchronous kernel execution on 1.x devices :)
The major downside I see is the complications of merging 2+ single-block kernels into a single multi-block kernel, in terms of wasted smem, limited instructions (there’s an upper limit on how many kernels you can merge into one), cache thrashing of TMUs / cmem, etc…