Hi,
I need to transpose a 39116 x 145 matrix using Cuda with shared memory. I already did the task without using shared memory:
__global__ void transpose_GPU(float *input, float *output)
{
output[blockIdx.x * blockDim.x + threadIdx.x] = input[threadIdx.x * gridDim.x + blockIdx.x];
}
transpose_GPU<<<Len, Size>>>(inputMat, outputMat);
//Where Len is 39116 and Size is 145
I tried to do this in a simple way using shared memory but performance only suffered:
__global__ void transpose_GPU_shared(float* input, float* output)
{
__shared__ float tile[145];
tile[threadIdx.x] = input[threadIdx.x * 39116 + blockIdx.x];
__syncthreads();
output[blockIdx.x * 145 + threadIdx.x] = tile[threadIdx.x];
}
transpose_GPU_shared<<<39116, 145>>>(inputMat, outputMat);
I think that what I need to do is use different shared memory, probably different grid and block size too. The way I mentioned before is the only one that works for me, unfortunately. I’ve heard that tile should be [32] or [32][32] (32+1 for dealing with bank conflict), I just don’t know how to do it with the matrix dimensions I have.
Thanks for help.