What is the warp size of 32x32_128x2

Threadblock size is 32*32, warp size is 128*2???
cutlass_80_wmma_tensorop_h161616gemm_32x32_128x2_tt_align2

The warp size is always 32 (with potentially some threads inactive).

Besides the (official) sizes for invoking a kernel: grid size and block size
a kernel can also have internal stride sizes and structure sizes.
Each thread can access several elements and the multidimensional block and grid sizes can be used in a different way for different computation steps of the kernel.
Some warps could even not participate for a computation step or exit early.

oh, yes. I know 32.
“cutlass_80_wmma_tensorop_h161616gemm_32x32_128x2_tt_align2” is a kernel name of cutlass, but I can not understand what does it mean…

It is the name of a non-public function. NVIDIA is unlikely to provide a directory and a list of naming conventions for internal non-public functions in their libraries. The naming of internal functions may not even be consistent or 100% “correct”.

Why do you think you need to “decode” the naming of this function?

1 Like

Because I love NV! Hahahahah!

Just kidding, you know I am a cuda developer, that could help me understand more about GPU.

Thanks~

There can be tile sizes per block involved, then a wmma size per warp for matrix multiplication, also the kernel handles transposition of the A and/or B matrix and the data can be aligned by 2 bytes (half16).

Well, my guess is, because one warp is enough to compute 3232, so a whole block with multiple warps (namely here should be 4?) Maybe inside they use sliceK, so a whole block computes 3232.