GEMM tile dimensions for tensor cores

I have observed that for a 512x64 GEMM operation, Nvidia doesn’t use a 512x64 GEMM tile and instead uses 4 rounds of 128x64 tile. In the list of tiles used, there is 256x128 tile present that maps to a single SM, but there is no 512x64, which would require the same memory. Why is this done?

Thank you.

Hi @sr3007,
It doesn’t seems to be related to cuDNN. Request you to please raise new topic under below category:

In case, I have missed any cuDNN specific query , request you to please elaborate more on that query so we can help better.