Question about NVIDIA Visual Profiler's occupancy results

Hi!

I am trying to extract information about resources of a CUDA application I have developed.

I am using nvvp (Nvidia visual profiler).
In Unguided Analysis-> Kernel latency, there are resources and occupancy information that I am intersted in.

However, the results are not that clear to me.

For example here is a picture of the nvvp’s result about occupancy:

I have the following equestions:
1)what is the theritical value and why it is different from achieved?
2)Why in active blocks and active threads there is not achieved value and there is only theoritical?
How is it possible to find the achieved value of those values?
3)Why does Active Warps achieved value is 62.23? This is the actual number of actuve warps right? How is it possible to be 61.23 and not an integer number?

Thank you

Try searching for “cuda profiler metrics”. The results you get most likely answer these questions plus a lot more that all of us will eventually have.

achieved occupancy = active_warps_sm / active_cycles_sm / MAX_WARPS_PER_SM * 100.0

theoretical occupancy is calculated using the occupancy calculator.

active_warps_sm increments by [0 - (SM_COUNT * MAX_WARPS_PER_SM)] per cycle.
active_cycles_sm increments by [0 - SM_COUNT] per cycle.

1)what is the theoretical value and why it is different from achieved?

The achieved value is measured on the hardware. The theoretical value is calculated based upon the launch information. Thread blocks and warps take cycles to launch so achieved < theoretical. If a warp exits early then it will not contribute for as many cycles.

2)Why in active blocks and active threads there is not achieved value and there is only theoretical?
How is it possible to find the achieved value of those values?

The CUDA profilers do not measure active blocks or active threads. It could be useful to measure active blocks. Comparing the average thread block duration to the average warp latency for a kernel would indicate if there exists a tail effect.

The CUDA profilers do not measure active threads as this is not a very useful number. The profilers do measure the average number of threads executed per instruction executed and the average number of active predicated true threads executed per instruction executed. These metrics are easy to take action on. This information is presented per SASS instruction, per high level source code, and per kernel in all three profilers (Nsight VSE, Nsight Compute, and NVVP/nvprof),

3)Why does Active Warps achieved value is 62.23? This is the actual number of active warps right? How is it possible to be 61.23 and not an integer number?

The formula results in a whole number divided by a whole number results in a factional number of the numerator is not a multiple of the denominator.