Hi all,
I’m trying to use the transpose function in the SDK.
When I use this function to transpose a large matrix (eg : 4000x4000) , I have bugs : some elements are moved in wrong positions.
The function used :
__global__ void transpose(float *odata, float *idata, int width, int height)
{
__shared__ float block[BLOCK_DIM][BLOCK_DIM];
// read the matrix tile into shared memory
unsigned int xIndex = blockIdx.x * blockDim.x  + threadIdx.x;
unsigned int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
if((xIndex < width) && (yIndex < height))
{
 unsigned int index_in = yIndex * width + xIndex;
 block[threadIdx.y][threadIdx.x] = idata[index_in];
}
__syncthreads();
// write the transposed matrix tile to global memory
xIndex = blockIdx.y * blockDim.y + threadIdx.x;
yIndex = blockIdx.x * blockDim.x + threadIdx.y;
if((xIndex < height) && (yIndex < width))
{
 unsigned int index_out = yIndex * height + xIndex;
 odata[index_out] = block[threadIdx.x][threadIdx.y];
}
}
And then what I do to use this function with all kings of matrixes :
  // ptr_learnSet[nb_attributes][nb_learnSet]
 deviceMalloc((void**) &d_data, memSize);
 hostToDevice(d_data, ptr_learnSet, memSize);
// Use a matrix where size %16 == 0
  unsigned int size_x = nb_attributes + (BLOCK_DIM-(nb_attributes%BLOCK_DIM));
 unsigned int size_y = nb_learnSet + (BLOCK_DIM-(nb_learnSet%BLOCK_DIM));
 dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1);
 dim3 threads(BLOCK_DIM, BLOCK_DIM, 1);
transpose<<< grid, threads >>>(d_data, d_data, nb_attributes, nb_learnSet);
deviceToHost(ptr_learnSetByRecords, d_data, memSize);
If my 2D grid size is greater than 7, some element start to be moved at wrong positions…
Per example, with a matrix 97x2, I have one misplaced element (the 191th element)…
Is my understanding of Grids and Blocks correct?!
Would you have an idea to fix this bug? Have I forgotten something about the manipulation of a grid?
I suppose that you all think that my code is very simple, but I’m a newbie in CUDA and it’s very hard for me to work on it.
Thank you for your help,
D1mmu.