Must all threads execute the same code? "Branch divergence occurs only within a warp"

Louis_Coder · December 28, 2008, 12:56pm

One thing I do not understand. That what’s written within ‘ptx_isa_1.2.pdf’, page 9 (please have a look at the attached screen shot).
There an nVidia author writes:

“A warp executes one common
instruction at a time, so full efficiency is realized when all threads of a warp agree on their
execution path. If threads of a warp diverge via a data-dependent conditional branch, the
warp serially executes each branch path taken, disabling threads that are not on that path,
and when all paths complete, the threads converge back to the same execution path.”

Does that mean when I don’t have a lot of threads execute the EXACT same code, the execution speed
will decrease by several dozen times?

I remember the CUDA example with the matrix multiplication. That’s logical that you must execute
the very same code for every piece of the matrix.
But I wanted to run a similar string search. The code is relative complex (compared to the matrix multiplication).

I planned to take my existing search code and just change the loop borders, so that the first thread searches the first
5 percent of the saved n-grams, the second thread the 2nd 5 percent etc.
I did not think about what branches the code takes at ‘if’ statements, for example.
Is this a bad idea?

On a multicore CPU (‘C’, not ‘G’!) it runs #Core times faster.
Will it also run faster on a GPU or is this a problem because of this branch divergence?

Thanks in advance. (also thanks for your previous answers :) )

E.D_Riedijk · December 28, 2008, 1:54pm

it means that is you have code like

if (threadIdx.x > 15)
funcA
else
funcB

half your threads will be disable in funcA and half the threads will be disabled in funcB (only for the first warp)
so that warp will run at half speed compared too full speed

MisterAnderson42 · December 28, 2008, 2:24pm

Not necessarily, no. If your algorithm is bounded by the memory bandwidth, then there are plenty of idle clocks left around to process divergent warps so the wall clock time cost is nil.

Louis_Coder · December 28, 2008, 2:31pm

Ok, does that mean I shouldn’t think too much when I write my code?

Ocire · December 28, 2008, 3:42pm

just keep in mind that you should avoid divergence in warps when possible. (but don’t put much time/effort into that)
after your first version is running, analyze if you’re memory-bound.
if yes: put every effort into memory bandwidth optimization
if not: first look for expensive functions in your code and then look that you get rid of some divergence.

alex_dubinsky · December 28, 2008, 11:22pm

No. If you have a divergent if(), its execution speed will be at most 2x slower. If you have a divergent for(), execution will take as long as the slowest thread. Only under extreme scenarios (like a divergent if() nested in a divergent if(), five times over) will the slowdown be “several dozen times.”

(This is in contrast to other sorts of conflicts, which are much more prone to causing full-on order-of-magnitude slowdowns.)

Topic		Replies	Views
Is there efficient way to deal with if/else in the kernel CUDA Programming and Performance	4	13885	June 14, 2009
Question about divergent branching CUDA Programming and Performance	3	6430	May 21, 2009
Thread question CUDA Programming and Performance	5	1878	December 2, 2008
reduction optimization #1 Not agree with performances explanation CUDA Programming and Performance	8	6664	August 1, 2008
Thread Divergence CUDA Programming and Performance	2	2730	September 27, 2008
Shift direction and divergence CUDA Programming and Performance	7	381	November 13, 2020
Branching in kernel CUDA Programming and Performance	3	5313	June 5, 2008
Performance of Divergent Threads CUDA Programming and Performance	2	1636	September 29, 2008
Difference between Thread Divergence and Warp Divergence CUDA Programming and Performance	3	9090	September 7, 2018
Thread divergence when block size is equal to warp size CUDA Programming and Performance	2	598	June 5, 2019

Must all threads execute the same code? "Branch divergence occurs only within a warp"

Related topics