Why are ROW and COL inverted in cudaTensorCoreGemm.cu?

Hi,

I am building a visual map of the cudaTensorCoreGemm.cu example to better understand it.

From the names and values of the macros:
WARP_ROW_TILES,
WARP_COL_TILES,
BLOCK_ROW_TILES,
BLOCK_COL_TILES,

I would expect a disposition similar to the one in this first image:

However, later in the code, WARP_COL_TILES and WARP_ROW_COLS are used in reverse. The MMA tiles are processed as shown in the next image:

Is there a specific performance reason behind this?

Am I missing something?

Link to the example:
cudaTensorCoreGemm.cu

Thank you.

I am not familiar with the code you are looking at and will not take the time to read through it.

Generally speaking, with GEMM matrices A and B can be used straight (“N”) or transposed (“T”). Importantly, any transpositions do not need to be performed explicitly, and typically are just done implicitly. With a standard GEMM, the fastest combo is typically “NT” (that is, A not transposed, B transposed), because this leads to contiguous linear memory access (as opposed to strided access) when loading tiles of either matrix.

You might want to inspect the code to see whether maximizing the efficiency of loads from memory is served by whatever arrangement you are observing.

1 Like