Trade-off within gemm block size of cutlass

examples/cute/tutorial/sgemm_nt_1.cu

Hi! I am learning cute (cutlass) and in this example code, I think each block computes 128*128 for C. Because:

  // Define block sizes (static)
  auto bM = Int<128>{};
  auto bN = Int<128>{};
  auto bK = Int<  8>{};

And C size is 51205120, using ncu, I know there is 4040 blocks, so each block should compute 128*128.

My ncu shows:


So why we choose this size? I know there should be some coverage between compute and latency, how this trade-off is computed? Someone could kindly give me some link to in-depth analyze? Maybe some papers~

Thank you!!!

By the way, the register usage is 98! And the shared memory is just 9126byte! Cute is really highly efficient!

All your questions to the inner workings of cutlass may be better suited for the cutlass developers directly. You can ask them on the cutlass github page.

1 Like