Occupancy question

g900nvda · November 27, 2024, 6:21pm

I was using one of cuda programming book and one of the occupancy topic says, something like that should report about 50% occupancy due to warp divergence:

void kernelmath(float *d_C) {
…
if ((tid % 1 ) == 0)
a=1;
else
b=1;
}
}

But real life example showing still 100% divergence and book explains that it is attributed to “compiler optimization that replaces branch instructions with predicated instructions for short conditional code such this” and enhances further by:
bool ipred = tid %2 ;
if (ipred == 1) {
a=1;
} else{
b=1;
}
}

and was getting about 71.43% branch occupancy.

Now following should get 100% due to no warp divergence due to entire warp will do the same things even though next warp may be it differently;
if ((tid / 32) %2 == 0) {
a=1 ;
else
b=1;

But that is from outdated book using M2070 old graphic card.

I tried to do same on RTX02070 much newer card and nvprof example used no longer works due to CC > 7.5. When I see ncu-ui
all reporting about ~60%for all 3 cases above: warp divergence, warp divergence with long code, no warp divergence).
For ncu, what i did was "ncu -o profile and opened the file in ncu-ui and observed achieved occupancy.

Am I doing wrong?

Greg · November 29, 2024, 9:16pm

The term “occupancy” has had numerous meanings so without the context of the book it is unclear what is meant by “occupancy”. The most common use is for warp occupancy which is the number of active/resident warps on an SM compared to the maximum number of warps supported by a SM.

The book may be using this to mean the active thread occupancy of an executed instruction

active threads per instruction executed

smsp__thread_inst_executed
smsp__inst_executed
smsp__thread_inst_executed_per_inst_executed

or active predicated on threads per instruction executed

smsp__thread_inst_executed_pred_on
smsp__inst_executed
smsp__thread_inst_executed_pred_on_per_inst_executed

The compiler optimizes such simple kernels making it very hard to learn about divergence and predication. The best two approaches I can offer are:

Use the CUDA debugger (cuda gdb or MS Visual Studio CUDA debugger) and single step the code in mixed mode (source + SASS). NOTE: using -G is going to add a lot of extra instructions to maintain variable scope.
Use the Nsight Compute Source View to look at the counters listed above in both CUDA C++ and SASS. Optimizing kernels often requires learning to read SASS. Reading SASS is not easy given lack of documentation. PTX is also an option but is often even more challenging as PTXAS is an optimizing assembler.

veraj · December 13, 2024, 9:17pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
question about calculating occupancy CUDA Programming and Performance	2	6542	April 7, 2010
Question regarding warp efficiency... CUDA Programming and Performance	9	15139	March 13, 2007
CUDA Occupancy Calculator accuracy? CUDA Programming and Performance	3	7474	March 26, 2007
about occupancy CUDA Programming and Performance	3	1675	December 16, 2009
Exact meaning of "occupancy" Slightly confused CUDA Programming and Performance	2	2290	April 20, 2009
occupancy bug CUDA Programming and Performance	0	1208	September 11, 2009
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5932	July 25, 2007
Occupancy CUDA Programming and Performance	3	3900	May 22, 2008
nvvp: count cycles where no warp is runnable not possible currently, but would be really helpful CUDA Programming and Performance	2	1076	June 4, 2013
I've a question about CUDA Occuapncy Calculator by NVIDIA CUDA Programming and Performance	13	2628	March 5, 2013

Occupancy question

Related topics