GEMM tile dimensions for tensor Cores

Hi,
I have observed that for a 512x64 GEMM operation, Nvidia doesn’t use a 512x64 GEMM tile and instead uses 4 rounds of 128x64 tile. In the list of tiles used, there is 256x128 tile present that maps to a single SM, but there is no 512x64, which would require the same memory. Why is this done?

Thank you.

Is this a question about CUBLAS ?