CUDA matrix transpose using shared memory

borysbors · July 6, 2020, 8:47pm

Hi,
I need to transpose a 39116 x 145 matrix using Cuda with shared memory. I already did the task without using shared memory:

    __global__ void transpose_GPU(float *input, float *output)
    {
        output[blockIdx.x * blockDim.x + threadIdx.x] = input[threadIdx.x * gridDim.x + blockIdx.x];
    }

transpose_GPU<<<Len, Size>>>(inputMat, outputMat);

//Where Len is 39116 and Size is 145

I tried to do this in a simple way using shared memory but performance only suffered:

__global__ void transpose_GPU_shared(float* input, float* output)
{
    __shared__ float tile[145];
    tile[threadIdx.x] = input[threadIdx.x * 39116 + blockIdx.x];
    __syncthreads();
    output[blockIdx.x * 145 + threadIdx.x] = tile[threadIdx.x];
}

transpose_GPU_shared<<<39116, 145>>>(inputMat, outputMat);

I think that what I need to do is use different shared memory, probably different grid and block size too. The way I mentioned before is the only one that works for me, unfortunately. I’ve heard that tile should be [32] or [32][32] (32+1 for dealing with bank conflict), I just don’t know how to do it with the matrix dimensions I have.

Thanks for help.

Topic		Replies	Views
Max matrix size for matrix transposition CUDA Programming and Performance	4	6385	April 3, 2011
Batch matrix transposed CUDA Programming and Performance	0	2356	August 21, 2009
Shared memory error CUDA Programming and Performance	1	896	June 24, 2012
Problem with shared memory CUDA Programming and Performance	6	904	October 23, 2015
Optimize problem regarding problem size CUDA Programming and Performance	4	6127	May 25, 2011
Matrix multiplication shared memory CUDA Programming and Performance	5	7114	April 6, 2015
Memory size in 'real problem' sizes?! CUDA Programming and Performance	6	6920	May 31, 2011
Shared memory dims and layout of matrix tiles loaded in CUDA Programming and Performance cuda	1	277	April 8, 2024
Example of Matrix Multiplication with Shared memory CUDA Programming and Performance	2	2056	June 22, 2011
rectangular matrix transpose CUDA Programming and Performance	3	7672	April 30, 2008

CUDA matrix transpose using shared memory

Related topics