An Efficient Matrix Transpose in CUDA C/C++

Originally published at: https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/

My last CUDA C++ post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance gains achievable using shared memory. Specifically, I will optimize a matrix transpose to show how to use shared memory to reorder strided global memory accesses into coalesced accesses.…

I have been trying to understand thread and block indexing pattern in simple matrix copy example. Why do we use TILE_DIM as a stride while calculating y since we know that our Block size is (TILE_DIM * BLOCK_ROWS). Besides we are amortizing the calculation by forcing each thread to do TILE_DIM / BLOCK_ROWS copies. I tried considering Threads Per Block as (4,1) and Blocks Per Grid as (2,2) with square matrix width as 8. I find that the offset values created also go beyond 15 which is above the matrix linear (1D) dimensions. Kindly help using an example. I would like to see some tutorial on Matrix tiling with amortization explained in detail.

__global__ void copy(float *odata, const float *idata)
{
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;

for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS)
odata[(y+j)*width + x] = idata[(y+j)*width + x];
}

A square matrix of size 8x8 has linear size of 64, which is much greater than 64. Even though each block has a number of rows smaller than TILE_DIM, each block is responsible for a whole tile. The key here is that tiles (and the whole matrix) are square, while the thread blocks are not.

Many thanks Mark for clarifying. If I am not wrong 4 thread grids will run through the kernel to cover the entire square matrix of size 8x8. And each thread will handle the copy 4 elements row-wise of the tile in a matrix.

I'm runnig this example on Tesla K20m
My output is
Device : Tesla K20m
Matrix size: 1024 1024, Block size: 32 8, Tile size: 32 32
dimGrid: 32 32 1. dimBlock: 32 8 1
Routine Bandwidth (GB/s)
copy 126.53
shared memory copy 83.17
naive transpose 53.42
coalesced transpose 62.11
conflict-free transpose 62.27
As you can see Bandwith is much smaller in comparison to Tesla K20c. Bandwith for kernels that are useing shared memory are very slow. Could you give me a clue what is happening?
My computer 2xTesla K20M 1xQuadro 410, Linux

There are many things that could be wrong, but basically this indicates to me your K20m is not running at full speed. K20m is a passively cooled K20c (meaning it has no fan -- the server fans must cool it). Are you sure it is sufficiently cooled? Is it getting sufficient power? Is it power capping?
What version of the driver/toolkit are you using? Is it a
shared/multi-user node? Is Persistence mode enabled? Have you checked the output of nvidia-smi?

Some nvidia-smi outputs are:
Driver version 319.37
Performance state P0
Power drawn 117.83 (during computation's)
Power limit 225
PCIE generation current 2 ; max 2
All clock's at max. My toolkit version is 5.5. Yes this is a multi user node, but during this computation only one user is logged in (me). The operating system is Scientific Linux 64 bit runlevel 3. I have enabled persistence mode, nothing changed. My power suply is 1500W (2x750W). Operating temperature is ~30 deg C, during long computations ~50. I've found that newest driver for tesla is 319.82 but Quadro 410 card is not suported. Newest driver for Quadro 410 is 331.38 not supporting Tesla. Should I reinstall the driver and get rid of Quadro card?

Some nvidia-smi outputs are:
Driver version 319.37
Performance state P0
Power drawn 117.83 (during computation's)
Power limit 225
PCIE generation current 2 ; max 2
All
clock's at max. My toolkit version is 5.5. Yes this is a multi user
node, but during this computation only one user is logged in (me). The
operating system is Scientific Linux 64 bit runlevel 3. I have enabled
persistence mode, nothing changed. My power suply is 1500W (2x750W).
Operating temperature is ~30 deg C, during long computations ~50. I've
found that newest driver for tesla is 319.82 but Quadro 410 card is not
suported in US version (in UK it is). Newest driver for Quadro 410 is 331.38 not supporting Tesla. I'm a bit lost in that... Could that drop in performance be caused by the driver itself? Should I reinstall the driver?

Have you tried running other tests to see if you get expected performance? I recommend you log in as a registered developer and file a bug. (register if you are not already here: https://developer.nvidia.co... ) Much easier than trying to debug in these comments. :) Alternatively, try the forums at http://devtalk.nvidia.com

Nice Article Mark, thank you

I'm trying to figure out how to change the indexes
to adapt this kernel

__global__ void transposeNoBankConflicts(float *odata, const float *idata)

to non-square MxN arbitrary size
could you give me some tips please?,

Regards!

I did some benchmarking on my 780 Ti for big matrices and different block configurations. I transposed a 11431 by 4651 matrix. I found that having 32x32 blocks performs the fastest. Also it seems like nvcc does not perform any loop unrolling, because by manual unrolling I gained a notable performance boost. Here are my results:
-32x32, naive: 4.75 ms
-16x16, naive: 3.71 ms
-32x8: 3,64 ms
-32x8, unrolled: 3,27 ms
-32x16: 2,90 ms
-32x16, unrolled: 2,69 ms
-32x32, "unrolled": 2,62 ms

I had to extend the above code a little to work with non-square matrices. If anybody feels too lazy, here is my final version for 32x32 blocks (compute capability 2.0 and higher):

__global__ void transpose_sharedMem(float *odata, const float *idata,
const uint32 width, const uint32 height)
{
__shared__ float tile[32][33];
int x = blockIdx.x * 32 + threadIdx.x;
int y = blockIdx.y * 32 + threadIdx.y;

if (x < width && y < height)
{ tile[threadIdx.y][threadIdx.x] = idata[y*width + x]; }
__syncthreads();

x = blockIdx.y * 32 + threadIdx.x; // transpose block offset
y = blockIdx.x * 32 + threadIdx.y;
if (y < width && x < height)
{ odata[y*height + x] = tile[threadIdx.x][threadIdx.y]; }
}

Thougt that it could not work for non-square matrices, but maybe I was wrong

I am getting following output on my machine which is not as expected. Can u please tell me what difference in architecture of my machine didn't give expected result

Device : Tesla K20Xm
Matrix size: 1024 1024, Block size: 32 8, Tile size: 32 32
dimGrid: 32 32 1. dimBlock: 32 8 1
Routine Bandwidth (GB/s)
copy 145.23
shared memory copy 92.51
naive transpose 61.84
coalesced transpose 69.81
conflict-free transpose 69.34

why shared memory copy is giving u more bandwidth as without using shared memory, all loads of threads in the warp will be coalesced and then all store will be coalesced. in fact I think register will be used for loading and storing element. Therefore copy without shared memory should perform well.

I noticed similar results, so I profiled the application and noted that there is considerable variation in the duration of the kernels. So I increased the number of runs from 100 to about 10000 and now my results are closer to what is expected:

Device : GM20B

Matrix size: 1024 1024, Block size: 32 8, Tile size: 32 32

dimGrid: 32 32 1. dimBlock: 32 8 1

Routine Bandwidth (GB/s)

copy 18.54

shared memory copy 18.32

naive transpose 7.47

coalesced transpose 14.26

conflict-free transpose 16.61

I just found an nvidia-document from 2009 suggesting that diagonal block reordering is indispensable due to partition camping (http://www.cs.colostate.edu....
Your benchmarks seem to disconfirm that. Is it indeed no longer a concern with today's (post-1.x) cards?

Correct, partition camping is not really an issue on today's GPUs. Even at the time of this post it was not a big issue, so we omitted the block reordering guidance from the post to simplify the task.

Just for curiosity: Why is it no longer an issue on today's GPUs and are there some exceptional cases where partition campaign does become relevant?

We built more sophisticated memory address translation hardware to avoid camping in the common cases. The incidence of partition camping is greatly reduced so it is rarely an issue. There are of course exceptional cases. One example is if many thread blocks access all in coalesced fashion but they all access the same small amount of data (maybe 128 bytes) simultaneously, so they all map to the same partition.

In my personal opinion CUDA is more user-friendly and easy to integrate than FPGA's.