rectangular matrix transpose


I’m trying to use the matrix transpose code sample from CUDA SDK.
It works well for square matrices and power of 2 matrices.
I now have 2 problems :

  • For matrices smaller than 1616, it doesn’t work since there is no threads created. I corrected that by using the naive version for matrices smaller than 1616 and by calling it with :
    dim3 grid((size_x+15) / 16, (size_y+15) / 16, 1);

  • For fully arbitrary matrices (ie. non power of 2, non square), I get strange results (I still call it with the same dim3 grid as above) : I get several “0” in my matrix that should not be…

Is there any solution ?

Thank you very much in advance,


You have to pad your dimensions to be multiples of 16 so, say, a 100x200 matrix should be embedded in a 112x208 matrix, transposed, and then extracted.

The code should look something like this (Note: I haven’t checked for errors!):

float inarr, outarr;
) &inarr_pad, 1002004));
CUDA_SAFE_CALL(cudaMalloc((void**) &outarr_pad, 2001004));

… Load input array …

float inarr_pad, outarr_pad;
) &inarr_pad, 1122084));
CUDA_SAFE_CALL(cudaMalloc((void**) &outarr_pad, 2081124));

// Clear padded arrays
cudaMemset(inarr_pad, 0, 1122084);
cudaMemset(outarr_pad, 0, 1122084);

// Load padded input array
for (int i=0; i<200; i++)
CUDA_SAFE_CALL(cudaMemcpy(&inarr_pad[i*112], &inarr[i*100], 100*4,

// Perform transpose
dim3 grid_tran(7, 13, 1);
dim3 threads_tran(16, 16, 1);
transpose<<< grid_tran, threads_tran >>>(outarr_pad, inarr_pad, 112, 208);

// Extracted transposed array
for (int i=0; i<100; i++)
CUDA_SAFE_CALL(cudaMemcpy(&outarr[i*200], &iutarr_pad[i*208], 200*4,


Thank you very much for your answer, but is there a way to avoid padding ? I am implementing a general Matrix class with as few readback as possible, doing linear algebra stuffs… and padding my matrices only for the transpose function will provoke lots of problems for the other functions, and I don’t want to get the CPU involved in this function…

Thanks !

I know that my reply is late, but I had the same problem and I found another solution (I post it, it could help someone in the futur):

 // h_data ==> matrix [realSizeX][realSizeY]

  uint memSize = sizeof(float) * realSizeX * realSizeY;

  float *d_data;

 uint size_x = realSizeX + (BLOCK_DIM-(realSizeX%BLOCK_DIM));

  uint size_y = realSizeY + (BLOCK_DIM-(realSizeY%BLOCK_DIM));

 deviceMalloc((void**) &d_data, memSize);

 hostToDevice(d_data, h_data, memSize);

 dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1);

  dim3 threads(BLOCK_DIM, BLOCK_DIM, 1);

 // Transpose function in the SDK

  transpose<<< grid, threads >>>(d_data, d_data,  realSizeX,  realSizeY);

 deviceToHost(h_data, d_data, memSize);	


// Functions : 

int hostToDevice(void *to, void *from, int size){return CUDA_SAFE_CALL(cudaMemcpy(to, from, size, cudaMemcpyHostToDevice));}

int deviceToHost(void *to, void *from, int size){return CUDA_SAFE_CALL(cudaMemcpy(to, from, size, cudaMemcpyDeviceToHost));}

int deviceFree(void *ptr){return CUDA_SAFE_CALL(cudaFree(ptr));}

int deviceMalloc(void** ptr, int size){return CUDA_SAFE_CALL(cudaMalloc(ptr, size));}

Note : the solution proposed by jomoga seems to be faster if your matrix is not too large.