Occupancy is not like I expected

Good morning everyone,
I wrote a program to multiply matrices (D = A*B) by making use of mixed-precision tensor cores (wmma API) and shared memory. To do this I followed the samples in the cuda github repository. I’m working with a Titan RTX GPU, which, if I’m not wrong, has 72 SM and 64KB of maximum user manageable shared memory. The kernel splits D in block tiles of size 128x112, that means 128x112xsizeof(float) = 56KB. Each block has 8 warps and each warp computes a row of fragments 16x16. So, a warp has to compute 7 fragments, in fact 112/16=7. Similarly, A and B are divided in tiles in order to load them in the shared memory. The tile dimension for A is 128x112, while the one for B is 112x112, obtaining 240x112xsizeof(half) =53,760B and by adding the skew overhead of 32B per row (I shift by 8 shared memory words) the overall dimension will be 53 760 + 240x32 = 60KB. The point is that I expect the occupancy to be constrained only by the shared memory. I expect one block per SM, since each block uses max 60KB of shared memory, thus two blocks cannot be allocated to the same SM concurrently. Other resources should not constraint the occupancy, because there are 8 warps per block, that means 256 threads per block which is far way lower than the maximum number of concurrent threads in an SM. All in all I would expect 72 concurrent blocks (one per SM), but when I run experiments I see that with more than 36 blocks they start to overlap. I see that by running the program with matrices sizes such that they are formed by 1, then 2, then 4, then 8 ecc. blocks. Until 36 the time spent inside the kernel is always the same (I use cuda events and also the clock function to see this). Over 36 blocks the time doubles and it remains stable until 72 blocks. Is there any parameter that I’m missing? Does it make sense to have 36 concurrent blocks with this parameters? It kind of seems as if a block is allocated in 2 SM, but it doesn’t make any sense to me. Maybe I’m using too much registers?

shMemMult.cpp (11.9 KB)

I attached my code. Hoping it’s not too confused. I changed the extension from .cu to .cpp otherwise it wasn’t supported in here.

I tried to be as concise as possible assuming that someone already knows how a matrix multiplication by tiles is structured, so if the problem is not clear just tell me that. I’m new here so be gentle, please :))))

Use of <= 64 KB of shared memory should be possible per SM. In Turing the two SMs in the TPC arbitrate for access to the shared memory for accesses but there should be no resource allocation constraints that would allow one SM to block resources for another SM.

I would recommend running the application through Nsight Compute. The GPU Speed of Light section and the Occupancy section should be able to identify your issue. An example of why going from 36 to 37-72 blocks might 2x duration is that a block on TPC.SM0 is using 100% of shared memory. When you add a thread block on TPC.SM1 it is interleaving access with SM0 increasing duration of both thread blocks by 2x.

Hi Greg, thanks for the answer.
I’m a bit confused. Are you saying that the two SMs in a TPC compete for their shared memory?
When you say TPC.SM0 uses 100% of shared memory, do you mean the 64KB reserved just for TPC.SM0? how is it possible that this affects the computation of TPC.SM1, since TPC.SM1 has its own shared memory?

I tried nv-nsight-cu-cli but unfortunately it gives me ‘==WARNING== No kernels were profiled’. Maybe I have to run the program with sudo, but I don’t have the priviledge. However, is nv-nsight-cu-cli the right profiler?

100% of the LSU request and write back bandwidth. If SM0 is at 100% and accesses are arbitrated then when SM1 also tries to access the latency will double.

The shared memory capacity is not shared. The physical SRAM is shared.

This is the correct profiler. Either the admin has to grant profiling permission to users or you have to run with elevated permissions.

Clear. Do you have any reading suggestion to deal with this issue? I would like to understand how LSU works in Turing architecture. I still don’t know much about latency.