rectangular matrix transpose

WhitAngl · September 11, 2007, 9:14am

Hi,

I’m trying to use the matrix transpose code sample from CUDA SDK.
It works well for square matrices and power of 2 matrices.
I now have 2 problems :

For matrices smaller than 1616, it doesn’t work since there is no threads created. I corrected that by using the naive version for matrices smaller than 1616 and by calling it with :
dim3 grid((size_x+15) / 16, (size_y+15) / 16, 1);
For fully arbitrary matrices (ie. non power of 2, non square), I get strange results (I still call it with the same dim3 grid as above) : I get several “0” in my matrix that should not be…

Is there any solution ?

Thank you very much in advance,

Nicolas

jomoga · September 12, 2007, 2:53pm

You have to pad your dimensions to be multiples of 16 so, say, a 100x200 matrix should be embedded in a 112x208 matrix, transposed, and then extracted.

The code should look something like this (Note: I haven’t checked for errors!):

float inarr, outarr;
CUDA_SAFE_CALL(cudaMalloc((void) &inarr_pad, 1002004));
CUDA_SAFE_CALL(cudaMalloc((void**) &outarr_pad, 2001004));

… Load input array …

float inarr_pad, outarr_pad;
CUDA_SAFE_CALL(cudaMalloc((void) &inarr_pad, 1122084));
CUDA_SAFE_CALL(cudaMalloc((void**) &outarr_pad, 2081124));

// Clear padded arrays
cudaMemset(inarr_pad, 0, 1122084);
cudaMemset(outarr_pad, 0, 1122084);

// Load padded input array
for (int i=0; i<200; i++)
CUDA_SAFE_CALL(cudaMemcpy(&inarr_pad[i*112], &inarr[i*100], 100*4,
cudaMemcpyDeviceToDevice));

// Perform transpose
dim3 grid_tran(7, 13, 1);
dim3 threads_tran(16, 16, 1);
transpose<<< grid_tran, threads_tran >>>(outarr_pad, inarr_pad, 112, 208);

// Extracted transposed array
for (int i=0; i<100; i++)
CUDA_SAFE_CALL(cudaMemcpy(&outarr[i*200], &iutarr_pad[i*208], 200*4,
cudaMemcpyDeviceToDevice));

Joel

WhitAngl · September 12, 2007, 4:34pm

Thank you very much for your answer, but is there a way to avoid padding ? I am implementing a general Matrix class with as few readback as possible, doing linear algebra stuffs… and padding my matrices only for the transpose function will provoke lots of problems for the other functions, and I don’t want to get the CPU involved in this function…

Thanks !

D1mmu · April 30, 2008, 1:17pm

I know that my reply is late, but I had the same problem and I found another solution (I post it, it could help someone in the futur):

 // h_data ==> matrix [realSizeX][realSizeY]

  uint memSize = sizeof(float) * realSizeX * realSizeY;

  float *d_data;

 uint size_x = realSizeX + (BLOCK_DIM-(realSizeX%BLOCK_DIM));

  uint size_y = realSizeY + (BLOCK_DIM-(realSizeY%BLOCK_DIM));

 deviceMalloc((void**) &d_data, memSize);

 hostToDevice(d_data, h_data, memSize);

 dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1);

  dim3 threads(BLOCK_DIM, BLOCK_DIM, 1);

 // Transpose function in the SDK

  transpose<<< grid, threads >>>(d_data, d_data,  realSizeX,  realSizeY);

 deviceToHost(h_data, d_data, memSize);	

  deviceFree(d_data);	

// Functions : 

int hostToDevice(void *to, void *from, int size){return CUDA_SAFE_CALL(cudaMemcpy(to, from, size, cudaMemcpyHostToDevice));}

int deviceToHost(void *to, void *from, int size){return CUDA_SAFE_CALL(cudaMemcpy(to, from, size, cudaMemcpyDeviceToHost));}

int deviceFree(void *ptr){return CUDA_SAFE_CALL(cudaFree(ptr));}

int deviceMalloc(void** ptr, int size){return CUDA_SAFE_CALL(cudaMalloc(ptr, size));}

Note : the solution proposed by jomoga seems to be faster if your matrix is not too large.

Topic		Replies	Views
Question about tranpose CUDA Programming and Performance	19	7313	June 11, 2008
SDK Transpose revisited ... yet again! CUDA Programming and Performance	3	4855	May 16, 2008
Matrix transpose problem (SDK example) for matrices that are not multiple of 16 CUDA Programming and Performance	0	5481	December 2, 2007
Transpose matrix like 8x1M in bytes by memcpy2d CUDA Programming and Performance cuda	10	53	November 13, 2024
copying memory, devicetohost and hosttodevice CUDA Programming and Performance	5	4096	June 25, 2009
Max matrix size for matrix transposition CUDA Programming and Performance	4	6386	April 3, 2011
Non-square Matrix Multiplication, not getting any cValues back CUDA Programming and Performance	2	3393	July 1, 2009
Transposing register-held matrices with warp shuffles? Need help. CUDA Programming and Performance	7	3837	July 27, 2023
Intro CUDA - Matrix Multiplication Returning Odd Values CUDA Programming and Performance	1	5709	June 25, 2009
3D matrix transpose CUDA Programming and Performance	5	5157	March 6, 2011

rectangular matrix transpose

Related topics