Sm and we efficiency

I have two implementations:
one with kernel that make sum
one with kernel that make mul
In the first i have sm_e: 98% and we_e: 87%
In the second i have sm_e: 95% and we_e: 90%
How could explain that?


Can you tell us what tool you used? And explain what does sm_e and we_e stand for, and how do you collect these?

nvidia profiler (nvprof)
with metrics for compute capability 5.x:
sm_efficiency: The percentage of time at least one warp is active on a specific multiprocessor
warp_execution_efficiency: Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor

Thanks for providing these details. Additional details about these metrics can be found in the post I want to know means about CUPTI metrics in details.

Thank you for reply.
My question is about the difference of two kernels.
The first kernel has better sm_e 98% instead of 95% of the second.
I supposed that the first kernel will have better we_e than the second.
But the first kernel has worst we_e 87% instead of 90%.

Both metrics (sm_e and we_e) are related with active warps.
Both of kernels have thread divergence problem (none algorithm efficiency applied).
So, the first kernel with worst we_e (i.e., with bigger thread divergence problem), i was waiting to have worst sm_e to.

sm_efficiency Is the ratio of cycles that a SM had at least 1 active warp to the total number of cycles executed in the measurement. It tells you how balanced your workload is. One should try to achieve this as max as possible.

warp_execution_efficiency is the ratio of average active threads per warp per instruction executed to the maximum number of threads per instruction (warp_size = 32). If this is less than 100% then the kernel has either thread divergence or the kernel was not launched with a multiple of 32 threads per block.

It’s not always that whenever sm_efficiency is high means warp efficiency will also be high. For example, when the warp has thread divergence i.e., different threads in the warp execute different instructions based on conditional branches (if-else), warp execution efficiency would be low, but sm_efficiency can be high if one or the other thread in the warp is active for most of the cycles.

Thank you for reply.
I have multiple 32 threads per block.
But i have if command with only true execution without else.
In this case i have thread divergence from if.
And the we_e it’s not 100%.

I understand that you can not be sure how many warps is active per cycle.