About CUTLASS example with nsight

While working on batched gemm (CUTLASS example here) and nsight, I have seen that for

  int const m = 4096;
  int const n = 4096;
  int const k = 4096;
  int const batch_count = 1;

the number of thread instructions smsp__thread_inst_executed.sum is 86,827,335,680.
However, for

  int const m = 1024;
  int const n = 1024;
  int const k = 1024;
  int const batch_count = 16;

the number of thread instructions smsp__thread_inst_executed.sum is 21,899,509,760.
Both of them are multiplying two 4096x4096 matrices and count values are 16,777,216 which means 4096*4096 elements in each matrix.

Is there any explanation about the difference? Or I am misunderstanding the matrices sizes.

I don’t think you can generally expect the number of instructions used in the kernel to correlate directly with the number of processed matrix elements, especially not if the kernel implementation is non-trivial. In your case, the kernel implementation is not directly available, since the computation is done by CUTLASS. The library is free to choose different code paths depending on the input sizes, batch count, etc. For more details on how these matrices might be processed differently, please check with the CUTLASS team, e.g. in their forum.

1 Like