Question about tranpose

Hi all,

I’m trying to use the transpose function in the SDK.

When I use this function to transpose a large matrix (eg : 4000x4000) , I have bugs : some elements are moved in wrong positions.

The function used :

__global__ void transpose(float *odata, float *idata, int width, int height)


	__shared__ float block[BLOCK_DIM][BLOCK_DIM];

	// read the matrix tile into shared memory

	unsigned int xIndex = blockIdx.x * blockDim.x  + threadIdx.x;

	unsigned int yIndex = blockIdx.y * blockDim.y + threadIdx.y;

	if((xIndex < width) && (yIndex < height))


 Â unsigned int index_in = yIndex * width + xIndex;

 Â block[threadIdx.y][threadIdx.x] = idata[index_in];



	// write the transposed matrix tile to global memory

	xIndex = blockIdx.y * blockDim.y + threadIdx.x;

	yIndex = blockIdx.x * blockDim.x + threadIdx.y;

	if((xIndex < height) && (yIndex < width))


 Â unsigned int index_out = yIndex * height + xIndex;

 Â odata[index_out] = block[threadIdx.x][threadIdx.y];



And then what I do to use this function with all kings of matrixes :

  Â  // ptr_learnSet[nb_attributes][nb_learnSet]

 Â deviceMalloc((void**) &d_data, memSize);

 Â hostToDevice(d_data, ptr_learnSet, memSize);

// Use a matrix where size %16 == 0

 Â  Â unsigned int size_x = nb_attributes + (BLOCK_DIM-(nb_attributes%BLOCK_DIM));

 Â unsigned int size_y = nb_learnSet + (BLOCK_DIM-(nb_learnSet%BLOCK_DIM));

 Â dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1);

 Â dim3 threads(BLOCK_DIM, BLOCK_DIM, 1);

transpose<<< grid, threads >>>(d_data, d_data, nb_attributes, nb_learnSet);

deviceToHost(ptr_learnSetByRecords, d_data, memSize);

If my 2D grid size is greater than 7, some element start to be moved at wrong positions…

Per example, with a matrix 97x2, I have one misplaced element (the 191th element)…

Is my understanding of Grids and Blocks correct?!

Would you have an idea to fix this bug? Have I forgotten something about the manipulation of a grid?

I suppose that you all think that my code is very simple, but I’m a newbie in CUDA and it’s very hard for me to work on it.

Thank you for your help,


I think your error is in the host code, you need to also have a partial block at the end of your rows and columns since your matrix is not a multiple of 16x16.

 dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1);

should be… because of integer division…

 dim3 grid((size_x+BLOCK_DIM-1) / BLOCK_DIM, (size_y+BLOCK_DIM-1) / BLOCK_DIM, 1);

This will bring things up to the next multiple of 16 in both dimensions.

Hope this helps,


Thank you for your help mhmerril! But size_x and size_y are ever multiple of 16 :

unsigned int size_x = nb_attributes + (BLOCK_DIM-(nb_attributes%BLOCK_DIM));

unsigned int size_y = nb_learnSet + (BLOCK_DIM-(nb_learnSet%BLOCK_DIM));

I could set these values as you said :

unsigned int size_x = nb_attributes + BLOCK_DIM-1;

unsigned int size_y = nb_learnSet + BLOCK_DIM-1;

But the result is the same…

The number of misplaced elements in the program change at each execution… How could I solve this problem? I think I don’t use the grid as I should…

Remember that your kernel call is asynchronous: control returns to the CPU while the GPU is still cranking away. With small amounts of data, the GPU manages to finish before you copy the results back to the host. With larger amounts, it’s not yet done when you grab the results.

Put this right before your deviceToHost() call:


Let us know if that was the problem.

I put :

CUDA_SAFE_CALL( cudaThreadSynchronize() );

before deviceToHost() and the problem is the same… I also put sleep(2); but it doesn’t change anything…

adding cudaThreadSynchronize shouldn’t change anything. A cudaMemcpy (or any other cuda function that accesses a device pointer) will wait for the kernel to finish before it performs the copy. cudaThreadSynchronize is only needed for timing purposes.

I’ve never looked at the transpose sample code in detail before, so I’m not sure what may be causing the problem. There may be more a more restrictive limit on the matrix size than it being a multiple of 16. This post: seems to indicate that the matrices must have a size as a multiple of BLOCK_DIM.

In fact, I’ve never seen in the CUDA docs that one should call cudaThreadsSynch before trying to access the result. My kernel launch release control to CPU immediately, but memory copies work correctly, which is the next line to it. I suspect CUDA has some kind of protection from that similar to the OpenGL principle - if you read output buffer with glRead it waits until shader completes. I think we need to hear NVIDIA guys comments here ;-)

yoavmor on this post is using my code to perform tranpose on very large matrixes and he contacts me because he has found this bug… So these 2 topics are the same problem,I’m trying to help him.

Section of the CUDA programming guide:

"Any kernel launch, memory set, or memory copy for which a zero stream parameter

has been specified begins only after all preceding operations are done, including

operations that are part of other streams, and no subsequent operation may begin

until it is done."

so it follows - no cudaThreadSynchronize needed, right?


And so what could be wrong if it’s not a synchronization problem?

D1mmu, Does this help you in any way?

No sorry because the solution given by Skribtsov :

is exactly what I do… And it does not work for a large grid (I don’t know why <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />)… And you did you fix that problem?

No… <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />
This is so strange.
Maybe someone from NVIDIA can lend us a hand here…? :(
I mean, I don’t think it’s a bug with CUDA, it’s more likely something that we’re doing wrong, I just have no idea what.

send me (or here) your code

Here is the code… The program doesn’t return errors each time, so if you start it many times, you will see that the misplaced elements change at each execution…

Thank you so much for your help Skribtsov! :thumbup:
largetranspose.tar.gz (2.35 KB)

I haven’t actually debugged or runned the code, but there is one place to check which is visible without run.

in your kernel, if you have read a value from the source matrix YOU MUST write it somewhere back.

So, you must remove the second condition from the shader’s code and embrace by the first condition the whole kernel.

In addition, your second condition looks suspicious. Are you sure you did not make a mistate in .x- and .y- s?

xIndex = blockIdx.y * BLOCK_DIM + threadIdx.x;
yIndex = blockIdx.x * BLOCK_DIM + threadIdx.y;
if((xIndex < height) && (yIndex < width))

I don’t think so… It’s exactly the code written in the SDK, I just made a main program to use it with more elements (sorry for my reply so late, I had some exams).

That’s correct I think - and yes, it also does look suspicious.

First you flip the blocks, then you flip the threads inside the block :)