Two instructions are not many? Actually there is also mma.m16n8k8. So there are three.
For mma.m8n8k4 please read the note:
mma.sync.m8n8k4 is optimized for target architecture sm_70 and may have substantially reduced performance on other target architectures.
Do not use that one. So two remain.
Use one of the other two. Use the larger, if you can handle the larger size for your problem to be solved and have at least Ampere, otherwise the medium one.
Newer GPU generations possibly could optimize the larger sizes better. Currently they should have the same speed. Using the larger one also saves on instructions and program size, which could be good for the instruction cache.