I’m currently studying CUDA Programming and getting confused with the relationship between Warp and Thread Block per SM.
On Compute Capability 9.0,
Maximum number of resident blocks per SM is 32
and Maximum number of threads per block is 1024.
So, I thought Maximum number of resident warps per SM should be 32*(1024/32) (Warp size) = 1024.
But it is 64 actually.
Can you explain why and how such number is calculated?