Is the documentation on MMA from the NVIDIA GTC 2020 talk incorrect?

I’m watching the NVIDIA GTC 2020 talk “Developing CUDA kernels to push Tensor Cores to the Absolute Limit on NVIDIA A100”, and I do not understand this example:

Presumably, there should be 4 values of C, right?

I’m also wondering if the data has to be organized in exactly this format. I.e, here, it looks like threads are tiled s.t thread 0 reads from row 0, and 8. But … could it be organized differently? I.e, thread 0 reads from rows 0, 1.

I think so, but TBH I’m not sure without trying it. I occasionally run across things I can’t explain regarding TC arguments.

The mapping of registers to matrix elements (and also threads providing those) is fixed for the mma instructions. The fixed mapping is covered in the PTX guide. Example.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.