Good morning everyone,
I wrote a program to multiply matrices (D = A*B) by making use of mixed-precision tensor cores (wmma API) and shared memory. To do this I followed the samples in the cuda github repository. I’m working with a Titan RTX GPU, which, if I’m not wrong, has 72 SM and 64KB of maximum user manageable shared memory. The kernel splits D in block tiles of size 128x112, that means 128x112xsizeof(float) = 56KB. Each block has 8 warps and each warp computes a row of fragments 16x16. So, a warp has to compute 7 fragments, in fact 112/16=7. Similarly, A and B are divided in tiles in order to load them in the shared memory. The tile dimension for A is 128x112, while the one for B is 112x112, obtaining 240x112xsizeof(half) =53,760B and by adding the skew overhead of 32B per row (I shift by 8 shared memory words) the overall dimension will be 53 760 + 240x32 = 60KB. The point is that I expect the occupancy to be constrained only by the shared memory. I expect one block per SM, since each block uses max 60KB of shared memory, thus two blocks cannot be allocated to the same SM concurrently. Other resources should not constraint the occupancy, because there are 8 warps per block, that means 256 threads per block which is far way lower than the maximum number of concurrent threads in an SM. All in all I would expect 72 concurrent blocks (one per SM), but when I run experiments I see that with more than 36 blocks they start to overlap. I see that by running the program with matrices sizes such that they are formed by 1, then 2, then 4, then 8 ecc. blocks. Until 36 the time spent inside the kernel is always the same (I use cuda events and also the clock function to see this). Over 36 blocks the time doubles and it remains stable until 72 blocks. Is there any parameter that I’m missing? Does it make sense to have 36 concurrent blocks with this parameters? It kind of seems as if a block is allocated in 2 SM, but it doesn’t make any sense to me. Maybe I’m using too much registers?
shMemMult.cpp (11.9 KB)
I attached my code. Hoping it’s not too confused. I changed the extension from .cu to .cpp otherwise it wasn’t supported in here.
I tried to be as concise as possible assuming that someone already knows how a matrix multiplication by tiles is structured, so if the problem is not clear just tell me that. I’m new here so be gentle, please :))))