Must all threads execute the same code? "Branch divergence occurs only within a warp"

One thing I do not understand. That what’s written within ‘ptx_isa_1.2.pdf’, page 9 (please have a look at the attached screen shot).
There an nVidia author writes:

“A warp executes one common
instruction at a time, so full efficiency is realized when all threads of a warp agree on their
execution path. If threads of a warp diverge via a data-dependent conditional branch, the
warp serially executes each branch path taken, disabling threads that are not on that path,
and when all paths complete, the threads converge back to the same execution path.”

Does that mean when I don’t have a lot of threads execute the EXACT same code, the execution speed
will decrease by several dozen times?

I remember the CUDA example with the matrix multiplication. That’s logical that you must execute
the very same code for every piece of the matrix.
But I wanted to run a similar string search. The code is relative complex (compared to the matrix multiplication).

I planned to take my existing search code and just change the loop borders, so that the first thread searches the first
5 percent of the saved n-grams, the second thread the 2nd 5 percent etc.
I did not think about what branches the code takes at ‘if’ statements, for example.
Is this a bad idea?

On a multicore CPU (‘C’, not ‘G’!) it runs #Core times faster.
Will it also run faster on a GPU or is this a problem because of this branch divergence?

Thanks in advance. (also thanks for your previous answers :) )

it means that is you have code like

if (threadIdx.x > 15)

half your threads will be disable in funcA and half the threads will be disabled in funcB (only for the first warp)
so that warp will run at half speed compared too full speed

Not necessarily, no. If your algorithm is bounded by the memory bandwidth, then there are plenty of idle clocks left around to process divergent warps so the wall clock time cost is nil.

Ok, does that mean I shouldn’t think too much when I write my code?

just keep in mind that you should avoid divergence in warps when possible. (but don’t put much time/effort into that)
after your first version is running, analyze if you’re memory-bound.
if yes: put every effort into memory bandwidth optimization
if not: first look for expensive functions in your code and then look that you get rid of some divergence.

No. If you have a divergent if(), its execution speed will be at most 2x slower. If you have a divergent for(), execution will take as long as the slowest thread. Only under extreme scenarios (like a divergent if() nested in a divergent if(), five times over) will the slowdown be “several dozen times.”

(This is in contrast to other sorts of conflicts, which are much more prone to causing full-on order-of-magnitude slowdowns.)