I have a kernel that was compiled for sm_90 that runs on H100 Gpu. Following is its launch statistics. It consumes 18.82 KB of shared memory. Since H100 has 228KB of shared memory per SM, technically considering the shared memory alone it should be able to reside 12 blocks
But as per the profiler, It says only 5 blocks can reside in a SM. This could be due to the Shared Memory Configuration Size which is set as 102.4KB. How is the 102.4KB is set? If I could increase that limit I should be able to run with a bigger grid size
I have set the preferred carveout value too 100 as mentioned in the reference you have provided, but still it doesn’t utilize the max available shared memory. Did you mean to say some other configuration?
I have the following small example which limits the occupancy to 4 in h100 while it could go for 32