How to choose from so many `mma.m*n*k*` instructions?

I’m reading the ptx manual to learn how to use tensor core.

I find there are so many instructions with different shapes and dtypes. How to choose from them?

For example, I want to compute fp16 gemm.

mma.m16n8k16 and mma.m8n8k4 can both be used to do this. What is the difference between them?

By the way, why are there so many instructions?

Thank you!

Two instructions are not many? Actually there is also mma.m16n8k8. So there are three.

For mma.m8n8k4 please read the note:

mma.sync.m8n8k4 is optimized for target architecture sm_70 and may have substantially reduced performance on other target architectures.

Do not use that one. So two remain.

Use one of the other two. Use the larger, if you can handle the larger size for your problem to be solved and have at least Ampere, otherwise the medium one.

Newer GPU generations possibly could optimize the larger sizes better. Currently they should have the same speed. Using the larger one also saves on instructions and program size, which could be good for the instruction cache.

2 Likes

Thank you for the clear explanation! I see some api is designed for new architectures .

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.