Threadblock size is 32*32, warp size is 128*2???
cutlass_80_wmma_tensorop_h161616gemm_32x32_128x2_tt_align2
The warp size is always 32 (with potentially some threads inactive).
Besides the (official) sizes for invoking a kernel: grid size and block size
a kernel can also have internal stride sizes and structure sizes.
Each thread can access several elements and the multidimensional block and grid sizes can be used in a different way for different computation steps of the kernel.
Some warps could even not participate for a computation step or exit early.
oh, yes. I know 32.
“cutlass_80_wmma_tensorop_h161616gemm_32x32_128x2_tt_align2” is a kernel name of cutlass, but I can not understand what does it mean…
It is the name of a non-public function. NVIDIA is unlikely to provide a directory and a list of naming conventions for internal non-public functions in their libraries. The naming of internal functions may not even be consistent or 100% “correct”.
Why do you think you need to “decode” the naming of this function?
Because I love NV! Hahahahah!
Just kidding, you know I am a cuda developer, that could help me understand more about GPU.
Thanks~
There can be tile sizes per block involved, then a wmma size per warp for matrix multiplication, also the kernel handles transposition of the A and/or B matrix and the data can be aligned by 2 bytes (half16).
Well, my guess is, because one warp is enough to compute 3232, so a whole block with multiple warps (namely here should be 4?) Maybe inside they use sliceK, so a whole block computes 3232.