4x4 wmma on tensor core

Hello I had read that Each Tensor Core provides a 4x4x4 matrix processing array which performs the operation D = A * B + C

I have a use case where this seems perfect but i need to multiply thousands of 4 by 4 matrices and accumulate results

Yet i see that cuda seem to supports only 16 by 16 matrix fragments so can I use tensor core for 4 by 4 matrices (in other way than simply add padding?

Hi @jakub.mitura14
This forum branch is dedicated to cuda-gdb tool support. Your question might be better suited for CUDA Programming and Performance - NVIDIA Developer Forums. I have moved your topic there.

1 Like

The smallest matrix or matrix fragment sizes that are supported by CUDA C++ directly using intrinsics is 16x16 or 8x32 or 32x8. Even at the ptx level there are no 4x4 ops exposed.

1 Like