How to evaluate if a kernel fully utilizes GPU?

Can we check if the kernel has fully utilized GPU through calculating the if allocated # blocks of a kernel is greater than the # SMs * Maximum threads blocks per SM ?

You would have to define what you mean by “fully utilize”.

If you are referring to occupancy, then your test should do it, but it is probably in some cases far more than what is needed.

The hardware limit for the maximum number of blocks per SM is something like 16 (will vary by GPU type). In many cases, it does not require 16 blocks per SM to reach full occupancy. With many launch configurations (threadblock sizes), it may require only 2-3.

But if you are looking for a simple metric, that would do it.

I think a “closer” calculation would be to make sure that each SM could have its maximum thread-carrying capacity. This is a number reported by deviceQuery, or you can look up in the documentation. Typical numbers here are 1024, 1536, or 2048. So if you make sure that:

number of blocks >= ((max threads per SM)/(threads per block)) x (number of SMs) 
                             ^                      ^                ^
       From:          deviceQuery      your kernel launch   deviceQuery

then I think you may in some cases get a closer number. It is still not a full/proper occupancy calculation because it does not take into account other limits, such as registers per SM, shared memory per SM, etc.

You can use the Occupancy API to figure out how many blocks are needed.

Please don’t post pictures of code on this forum