I’ll assume these are normalized results you’re talking about (4 branches, 1 divergent).
Thread divergence on 1.x devices is determined on a half-warp level (because all threads of a half-warp execute the same instructions - as to why this is the case, read up on the programming manual (to do with instruction scheduling frequency / clock frequency of processors / etc)) - as such the profiler (when it normalizes the counters) reports 1 of these for every half-warp that hits one.
On 1.x devices a warp is 32 threads, thus a half-wrap is 16 threads.
You’re executing 64 threads in this kernel, so 2 warps (4 half-warps).
Your branch is your if/else statement (if(threadIdx.x < 20) { … } else { … }), this is considered a branch no matter what (even if all threads take the same path, divergent branches are where threads take different paths) - and according to your code all threads will end up evaluating this branch… this is where your ‘4 branches’ in the profiler comes from.
The 1 divergent branch is from half-warp 2 (threads 16-32), where threads 16-19 will enter the ‘if(…)’ statement while threads 20-31 will enter your ‘else’ statement.
As I said before, this is a divergent branch because threads of a half-warp have to execute the same code, thus for a half-warp to execute any branch - the instructions have to be serialized - s.t. each set of threads in the divergent branch takes turns executing their part of the branch - and then resume together outside the branch once each set of threads has had it’s turn… (if that makes sense).
Your best bet to understand the visual profiler is to read the Help → “CUDA Visual Profiler Help” (F1) in the visual profiler
(cudaprof/doc/cudaprof.html under your CUDA Toolkit directory).
And reading the CUDA Programming Guide / these forums (many people have asked similar questions, nVidia and other people in the know have responded with some useful bits of info that aren’t documented anywhere else).