An Efficient Matrix Transpose in CUDA Fortran

Originally published at: https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-fortran/

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran. My previous CUDA Fortran post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance gains achievable using shared memory. Specifically, I will optimize a…

Why does padding the 1st dimension of the tile variable result in a less-bad shared memory bank conflict? I understand why having a 32x32 element tile results in a 32-way shared memory bank conflict, but I don't understand how adding an extra row fixes it.

Because memory is linear. :) When a warp accesses column 0 of a 32x32 tile, its 32 threads access memory words 0, 32, 64, 96... Since bank == word index % 32, that means the threads access banks 0, 0, 0, 0... They all access the SAME bank, meaning 32-way bank conflicts. But if the array is 33x32 (33 columns by 32 rows), they access words 0, 33, 66, 99 == banks 0, 1, 2, 3, .... They all access DIFFERENT banks. So it's an easy fix.