Occupancy question

I was using one of cuda programming book and one of the occupancy topic says, something like that should report about 50% occupancy due to warp divergence:

void kernelmath(float *d_C) {

if ((tid % 1 ) == 0)
a=1;
else
b=1;
}
}

But real life example showing still 100% divergence and book explains that it is attributed to “compiler optimization that replaces branch instructions with predicated instructions for short conditional code such this” and enhances further by:
bool ipred = tid %2 ;
if (ipred == 1) {
a=1;
} else{
b=1;
}
}

and was getting about 71.43% branch occupancy.

Now following should get 100% due to no warp divergence due to entire warp will do the same things even though next warp may be it differently;
if ((tid / 32) %2 == 0) {
a=1 ;
else
b=1;

But that is from outdated book using M2070 old graphic card.

I tried to do same on RTX02070 much newer card and nvprof example used no longer works due to CC > 7.5. When I see ncu-ui
all reporting about ~60%for all 3 cases above: warp divergence, warp divergence with long code, no warp divergence).
For ncu, what i did was "ncu -o profile and opened the file in ncu-ui and observed achieved occupancy.

Am I doing wrong?

The term “occupancy” has had numerous meanings so without the context of the book it is unclear what is meant by “occupancy”. The most common use is for warp occupancy which is the number of active/resident warps on an SM compared to the maximum number of warps supported by a SM.

The book may be using this to mean the active thread occupancy of an executed instruction

active threads per instruction executed

smsp__thread_inst_executed
smsp__inst_executed
smsp__thread_inst_executed_per_inst_executed

or active predicated on threads per instruction executed

smsp__thread_inst_executed_pred_on
smsp__inst_executed
smsp__thread_inst_executed_pred_on_per_inst_executed

The compiler optimizes such simple kernels making it very hard to learn about divergence and predication. The best two approaches I can offer are:

  1. Use the CUDA debugger (cuda gdb or MS Visual Studio CUDA debugger) and single step the code in mixed mode (source + SASS). NOTE: using -G is going to add a lot of extra instructions to maintain variable scope.
  2. Use the Nsight Compute Source View to look at the counters listed above in both CUDA C++ and SASS. Optimizing kernels often requires learning to read SASS. Reading SASS is not easy given lack of documentation. PTX is also an option but is often even more challenging as PTXAS is an optimizing assembler.