examples/cute/tutorial/sgemm_nt_1.cu
Hi! I am learning cute (cutlass) and in this example code, I think each block computes 128*128 for C. Because:
// Define block sizes (static)
auto bM = Int<128>{};
auto bN = Int<128>{};
auto bK = Int< 8>{};
And C size is 51205120, using ncu, I know there is 4040 blocks, so each block should compute 128*128.
My ncu shows:
So why we choose this size? I know there should be some coverage between compute and latency, how this trade-off is computed? Someone could kindly give me some link to in-depth analyze? Maybe some papers~
Thank you!!!
