An Efficient Matrix Transpose in CUDA Fortran

jwitsoe · October 9, 2013, 6:08pm

Originally published at: https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-fortran/

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran. My previous CUDA Fortran post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance gains achievable using shared memory. Specifically, I will optimize a…

anon15902721 · February 4, 2014, 4:26pm

Why does padding the 1st dimension of the tile variable result in a less-bad shared memory bank conflict? I understand why having a 32x32 element tile results in a 32-way shared memory bank conflict, but I don't understand how adding an extra row fixes it.

anon95180265 · February 5, 2014, 11:36pm

Because memory is linear. :) When a warp accesses column 0 of a 32x32 tile, its 32 threads access memory words 0, 32, 64, 96... Since bank == word index % 32, that means the threads access banks 0, 0, 0, 0... They all access the SAME bank, meaning 32-way bank conflicts. But if the array is 33x32 (33 columns by 32 rows), they access words 0, 33, 66, 99 == banks 0, 1, 2, 3, .... They all access DIFFERENT banks. So it's an easy fix.

Topic		Replies	Views
An Efficient Matrix Transpose in CUDA C/C++ Technical Blog	31	3154	October 30, 2020
transpose example, SDK 3.2 CUDA Programming and Performance	4	7562	March 15, 2011
Matrix transpose slower using shared memory CUDA Programming and Performance	5	1154	August 7, 2015
Why i get performance in this Kernel CUDA Programming and Performance	3	1676	July 13, 2008
Why the transpose speed is much quicker using shared memory inside CUDA Programming and Performance	3	6042	July 17, 2008
CUDA matrix transpose using shared memory CUDA Programming and Performance cuda	0	535	July 6, 2020
Using Shared Memory in CUDA Fortran Technical Blog	0	432	August 25, 2020
Matrix multiplication CUDA CUDA Programming and Performance	7	3076	November 12, 2012
The question of the example of "3.2.2.3 Shared Memory in Matrix Multiplication(C=A*A(T)" i CUDA Programming and Performance	0	1937	September 17, 2009
about __syncthreads() in SDK/project/transpose CUDA Programming and Performance	5	2813	September 18, 2009

An Efficient Matrix Transpose in CUDA Fortran

Related topics