CUDA matrix transpose using shared memory

Hi,
I need to transpose a 39116 x 145 matrix using Cuda with shared memory. I already did the task without using shared memory:

    __global__ void transpose_GPU(float *input, float *output)
    {
        output[blockIdx.x * blockDim.x + threadIdx.x] = input[threadIdx.x * gridDim.x + blockIdx.x];
    }

transpose_GPU<<<Len, Size>>>(inputMat, outputMat);

//Where Len is 39116 and Size is 145

I tried to do this in a simple way using shared memory but performance only suffered:

__global__ void transpose_GPU_shared(float* input, float* output)
{
    __shared__ float tile[145];
    tile[threadIdx.x] = input[threadIdx.x * 39116 + blockIdx.x];
    __syncthreads();
    output[blockIdx.x * 145 + threadIdx.x] = tile[threadIdx.x];
}

transpose_GPU_shared<<<39116, 145>>>(inputMat, outputMat);

I think that what I need to do is use different shared memory, probably different grid and block size too. The way I mentioned before is the only one that works for me, unfortunately. I’ve heard that tile should be [32] or [32][32] (32+1 for dealing with bank conflict), I just don’t know how to do it with the matrix dimensions I have.

Thanks for help.