Dynamic shared memory calculated by ncu larger than Max_shared_memory_per_block

I have utilized Nsight Compute to profile an AI model and observed that one of the kernels utilizes 61.4kb shared memory per block, as depicted in the provided figure.

I have inquired about the maximum shared memory a thread block can utilize using the “maxSharedMemLimitPerBlock” API, and the result I received was 49152 bytes. However, I am perplexed as to why the kernel can still utilize shared memory exceeding this maximum limit. Could someone please offer an explanation for this?

I came across a picture which illustrates that a GPU with a capability of 8.7 can support a maximum of 163 kb shared memory per thread block. I’m wondering how I can obtain this value using the CUDA API. Currently, I am using the API cudaDeviceGetAttribute(&sharedMemLimit, cudaDevAttrMaxSharedMemoryPerBlock, 0);, but it only returns 48 kb.
Furthermore, I have an additional question related to the third parameter of the <<<>>> operator, which represents the dynamic shared memory usage. I noticed that if I set this value larger than 48 kb, I receive an error message stating “invalid argument”. However, considering that the maximum shared memory a thread block can actually use is 163 kb, why is a value larger than 48 kb considered an invalid argument?

The last part of this section may help you.

" Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, as such they must use dynamic shared memory (rather than statically sized arrays) and require an explicit opt-in using cudaFuncSetAttribute() as follows."

Thanks for your response.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.