I have a question about how branches are dealt with in CUDA. According to the programming guide of CUDA, branches are either predicated or predicated. And divergent branches can happen between warps, otherwise it will hurt performance badly. My question is that is this corresponds to runtime branch resolution OR only analyzed at compile time? If divergent branch happens to the threads within a warp, what exactly is happening? Totally serialized? If totally serialized, coalesced memory accesses of threads are still coalesced or not? Thanks!
as far as I know, hardware does not do brach-predicate
if you have two-way if-then-else, say
[codebox]if ( predicate ) then
step 1: all threads in a warp (32 threads) executes “predicate function” and determine which way each thread should go.
suppose thread 0 ~ 15 has predicate 1 and thread 16~31 has predicate 0, then
step 2: thread 0 ~ 15 execute statement 1
step 3: thread 16 ~ 31 execute statement 2
step 4: thread 0 ~ 31 execute statement 4
step 2 and step 3 are serialized, so “coalesced memory accesses” is restricted in statement 1 or statement 2 respectively.
I don’t believe that is an accurate description of how branching works. All threads will execute both statements 1 & 2, but the results of the execution are masked out, depending on the state of the predicate evaluation. And in current hardware statement 2 is executed before statement 1.
CUDA Programming Guide v2.2 has explicit statements in Chapter 126.96.36.199, that only certain conditions (i.e., when the branched body is small enough), predications is used. My questions was mainly about serialization. So I think probably what u meant here is right. I have another question that for loops such as for or while which may based on a dynamic termination condition, how serialization happens? Thanks!