The documentation about branch divergence isn’t terribly clear.
It mentions that cuda hardware will execute “branch divergence” serially.
But what is ment with this ? It can be interpreted in different ways two ways:
if (A < B)
C = 1;
C = 2
This is a branch, so the threads might “diverge”. The documentation mentions it will be executed serially ?!
But what is ment with this, some possibilities:
- Each thread will execute serially, one after the other, thus 32 serial operations:
Thread 1 executes the branch after which
Thread 2 executes the branch after which
Thread 3 executes the branch after which
Thread 4 executes the branch after which
Thread 5 executes the branch after which
… and so forth… until…
Thread 32 executes the branch.
So the number of cycles needed for executing the branch goes from 1 to 32 cycles.
(assuming one instruction costs one cycle, or one branch costs only cycle, in reality it’s probably a comparision + a jump + an assignment)
However perhaps something different was ment by the documentation:
All threads execute all instructions in seriel in parallel, which means the serial list of instructions is executed fully, but in parallel, this is a weird sentence and a weird concept to describe with just words.
Another way to describe it is: all the (serial) instructions out of which the branch is made of/consists of are executed in parallel, step by step, one instruction in parallel, followed by the next one and so forth.
This leads to a different concept:
Thread 1…32 execute the comparision in parallel after which
Thread 1…32 execute (and/or jump) to target 1 of the branch (C=1) in parallel after which
Thread 1…32 execute (and/or jump) to target 2 of the branch (C=2) in parallel after which
During the second and third phase some executions might not take place because of a mask or predicate, or perhaps some results or simply thrown out. Though I have seen other documentation mention that there is talk of certain utilization of execution units ?!
Now assuming the second example is how the GPU actually works when it comes to execute these 32 thread warps… and the utilization of ALU, FADD and such is only 50% ? Can the GPU use the other 50% to process another warp ? Or is that 50% that was not used, simply wasted ?