Why do some kernels have 0 blocks per SM, especially gemm related ones?

Hi,
On A100-SXM4-80G, I used the ncu tool to profile the large model of llama2-7b, and I was surprised to find that at the time of generating the 2nd token, the number of blocks per SM of the kernel associated with the gemm is 0, i.e., as shown in the figure Block Limit Shared Mem [ block] is 0, which further causes Waves Per SM to be 0 as well, I would like to ask what is the reason for this? Is it related to the size of Shared Mem Per Block? For example, Shared Mem per block is allocated greater than 163KB?


Thanks for reporting this. At first glance, it looks like a bug or something isn’t being calculated properly by the tool, but we will do some more investigation and let you know what we determine.

Thank you very much for your reply, I understand : )