Question about tranpose

D1mmu · May 14, 2008, 6:08pm

Hi all,

I’m trying to use the transpose function in the SDK.

When I use this function to transpose a large matrix (eg : 4000x4000) , I have bugs : some elements are moved in wrong positions.

The function used :

__global__ void transpose(float *odata, float *idata, int width, int height)

{

	__shared__ float block[BLOCK_DIM][BLOCK_DIM];

	// read the matrix tile into shared memory

	unsigned int xIndex = blockIdx.x * blockDim.x Â + threadIdx.x;

	unsigned int yIndex = blockIdx.y * blockDim.y + threadIdx.y;

	if((xIndex < width) && (yIndex < height))

	{

 Â unsigned int index_in = yIndex * width + xIndex;

 Â block[threadIdx.y][threadIdx.x] = idata[index_in];

	}

	__syncthreads();

	// write the transposed matrix tile to global memory

	xIndex = blockIdx.y * blockDim.y + threadIdx.x;

	yIndex = blockIdx.x * blockDim.x + threadIdx.y;

	if((xIndex < height) && (yIndex < width))

	{

 Â unsigned int index_out = yIndex * height + xIndex;

 Â odata[index_out] = block[threadIdx.x][threadIdx.y];

	}

}

And then what I do to use this function with all kings of matrixes :

Â  Â  // ptr_learnSet[nb_attributes][nb_learnSet]

 Â deviceMalloc((void**) &d_data, memSize);

 Â hostToDevice(d_data, ptr_learnSet, memSize);

// Use a matrix where size %16 == 0

 Â  Â unsigned int size_x = nb_attributes + (BLOCK_DIM-(nb_attributes%BLOCK_DIM));

 Â unsigned int size_y = nb_learnSet + (BLOCK_DIM-(nb_learnSet%BLOCK_DIM));

 Â dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1);

 Â dim3 threads(BLOCK_DIM, BLOCK_DIM, 1);

transpose<<< grid, threads >>>(d_data, d_data, nb_attributes, nb_learnSet);

deviceToHost(ptr_learnSetByRecords, d_data, memSize);

If my 2D grid size is greater than 7, some element start to be moved at wrong positions…

Per example, with a matrix 97x2, I have one misplaced element (the 191th element)…

Is my understanding of Grids and Blocks correct?!

Would you have an idea to fix this bug? Have I forgotten something about the manipulation of a grid?

I suppose that you all think that my code is very simple, but I’m a newbie in CUDA and it’s very hard for me to work on it.

Thank you for your help,

D1mmu.

mhmerrill · May 15, 2008, 2:05am

I think your error is in the host code, you need to also have a partial block at the end of your rows and columns since your matrix is not a multiple of 16x16.

 dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1);

should be… because of integer division…

 dim3 grid((size_x+BLOCK_DIM-1) / BLOCK_DIM, (size_y+BLOCK_DIM-1) / BLOCK_DIM, 1);

This will bring things up to the next multiple of 16 in both dimensions.

Hope this helps,

Mike

D1mmu · May 15, 2008, 10:27am

I think your error is in the host code, you need to also have a partial block at the end of your rows and columns since your matrix is not a multiple of 16x16.
 dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1);
should be… because of integer division…
 dim3 grid((size_x+BLOCK_DIM-1) / BLOCK_DIM, (size_y+BLOCK_DIM-1) / BLOCK_DIM, 1);
This will bring things up to the next multiple of 16 in both dimensions.

Hope this helps,

Mike

[snapback]377270[/snapback]

Thank you for your help mhmerril! But size_x and size_y are ever multiple of 16 :

unsigned int size_x = nb_attributes + (BLOCK_DIM-(nb_attributes%BLOCK_DIM));

unsigned int size_y = nb_learnSet + (BLOCK_DIM-(nb_learnSet%BLOCK_DIM));

I could set these values as you said :

unsigned int size_x = nb_attributes + BLOCK_DIM-1;

unsigned int size_y = nb_learnSet + BLOCK_DIM-1;

But the result is the same…

The number of misplaced elements in the program change at each execution… How could I solve this problem? I think I don’t use the grid as I should…

redpill · May 15, 2008, 3:01pm

Remember that your kernel call is asynchronous: control returns to the CPU while the GPU is still cranking away. With small amounts of data, the GPU manages to finish before you copy the results back to the host. With larger amounts, it’s not yet done when you grab the results.

Put this right before your deviceToHost() call:

cudaThreadSynchronize();

Let us know if that was the problem.

D1mmu · May 15, 2008, 3:23pm

Remember that your kernel call is asynchronous: control returns to the CPU while the GPU is still cranking away. With small amounts of data, the GPU manages to finish before you copy the results back to the host. With larger amounts, it’s not yet done when you grab the results.

Put this right before your deviceToHost() call:
cudaThreadSynchronize();
Let us know if that was the problem.

[snapback]377503[/snapback]

I put :

CUDA_SAFE_CALL( cudaThreadSynchronize() );

before deviceToHost() and the problem is the same… I also put sleep(2); but it doesn’t change anything…

MisterAnderson42 · May 15, 2008, 3:44pm

adding cudaThreadSynchronize shouldn’t change anything. A cudaMemcpy (or any other cuda function that accesses a device pointer) will wait for the kernel to finish before it performs the copy. cudaThreadSynchronize is only needed for timing purposes.

I’ve never looked at the transpose sample code in detail before, so I’m not sure what may be causing the problem. There may be more a more restrictive limit on the matrix size than it being a multiple of 16. This post: [url=“http://forums.nvidia.com/index.php?showtopic=67179”]The Official NVIDIA Forums | NVIDIA seems to indicate that the matrices must have a size as a multiple of BLOCK_DIM.

Skribtsov · May 15, 2008, 3:46pm

Remember that your kernel call is asynchronous: control returns to the CPU while the GPU is still cranking away. With small amounts of data, the GPU manages to finish before you copy the results back to the host. With larger amounts, it’s not yet done when you grab the results.

Put this right before your deviceToHost() call:
cudaThreadSynchronize();
Let us know if that was the problem.

[snapback]377503[/snapback]

In fact, I’ve never seen in the CUDA docs that one should call cudaThreadsSynch before trying to access the result. My kernel launch release control to CPU immediately, but memory copies work correctly, which is the next line to it. I suspect CUDA has some kind of protection from that similar to the OpenGL principle - if you read output buffer with glRead it waits until shader completes. I think we need to hear NVIDIA guys comments here ;-)

D1mmu · May 15, 2008, 4:11pm

yoavmor on this post http://forums.nvidia.com/index.php?showtopic=67179 is using my code to perform tranpose on very large matrixes and he contacts me because he has found this bug… So these 2 topics are the same problem,I’m trying to help him.

seibert · May 15, 2008, 5:08pm

Section 4.5.1.5 of the CUDA programming guide:

"Any kernel launch, memory set, or memory copy for which a zero stream parameter

has been specified begins only after all preceding operations are done, including

operations that are part of other streams, and no subsequent operation may begin

until it is done."

Skribtsov · May 15, 2008, 5:23pm

so it follows - no cudaThreadSynchronize needed, right?

paulius · May 15, 2008, 9:05pm

Yes

D1mmu · May 16, 2008, 1:36pm

And so what could be wrong if it’s not a synchronization problem?

yoavmor · May 18, 2008, 7:09am

D1mmu, Does this help you in any way?

D1mmu · May 18, 2008, 8:02pm

No sorry because the solution given by Skribtsov :

is exactly what I do… And it does not work for a large grid (I don’t know why <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />)… And you did you fix that problem?

yoavmor · May 19, 2008, 7:40am

No… <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />
This is so strange.
Maybe someone from NVIDIA can lend us a hand here…? :(
I mean, I don’t think it’s a bug with CUDA, it’s more likely something that we’re doing wrong, I just have no idea what.

Skribtsov · May 19, 2008, 2:19pm

send me (or here) your code

D1mmu · May 20, 2008, 3:34pm

Here is the code… The program doesn’t return errors each time, so if you start it many times, you will see that the misplaced elements change at each execution…

Thank you so much for your help Skribtsov! External Media
largetranspose.tar.gz (2.35 KB)

Skribtsov · May 21, 2008, 4:14am

I haven’t actually debugged or runned the code, but there is one place to check which is visible without run.

in your kernel, if you have read a value from the source matrix YOU MUST write it somewhere back.

So, you must remove the second condition from the shader’s code and embrace by the first condition the whole kernel.

In addition, your second condition looks suspicious. Are you sure you did not make a mistate in .x- and .y- s?

xIndex = blockIdx.y * BLOCK_DIM + threadIdx.x;
yIndex = blockIdx.x * BLOCK_DIM + threadIdx.y;
if((xIndex < height) && (yIndex < width))

D1mmu · June 11, 2008, 6:03pm

I haven’t actually debugged or runned the code, but there is one place to check which is visible without run.

in your kernel, if you have read a value from the source matrix YOU MUST write it somewhere back.

So, you must remove the second condition from the shader’s code and embrace by the first condition the whole kernel.

In addition, your second condition looks suspicious. Are you sure you did not make a mistate in .x- and .y- s?
xIndex = blockIdx.y * BLOCK_DIM + threadIdx.x;

yIndex = blockIdx.x * BLOCK_DIM + threadIdx.y;

if((xIndex < height) && (yIndex < width))
[snapback]380403[/snapback]

I don’t think so… It’s exactly the code written in the SDK, I just made a main program to use it with more elements (sorry for my reply so late, I had some exams).

kristleifur · June 11, 2008, 6:32pm

In addition, your second condition looks suspicious. Are you sure you did not make a mistate in .x- and .y- s?
xIndex = blockIdx.y * BLOCK_DIM + threadIdx.x;

yIndex = blockIdx.x * BLOCK_DIM + threadIdx.y;

if((xIndex < height) && (yIndex < width))
[snapback]380403[/snapback]

That’s correct I think - and yes, it also does look suspicious.

First you flip the blocks, then you flip the threads inside the block :)

Topic		Replies	Views
about __syncthreads() in SDK/project/transpose CUDA Programming and Performance	5	2723	September 18, 2009
Efficient in-place transpose of multiple square float matrices CUDA Programming and Performance	10	5380	October 10, 2019
Transpose matrix like 8x1M in bytes by memcpy2d CUDA Programming and Performance cuda	10	45	November 13, 2024
Transpose example, strange dim dependent lagg.. CUDA Programming and Performance	24	12237	October 25, 2009
GPU Transfer problems GPU won't correctly read data out from Device to Host CUDA Programming and Performance	15	2633	August 2, 2010
Distance Transform CUDA problem CUDA Programming and Performance	10	6792	May 25, 2023
Urgent help with threads please! CUDA Programming and Performance	21	10786	March 6, 2008
An Efficient Matrix Transpose in CUDA C/C++ Technical Blog	31	2502	October 30, 2020
CUDA Memory Transpose CUDA Programming and Performance	10	1814	March 23, 2015
Problems of matrix multiplication With and without CUDA CUDA Programming and Performance	15	9997	January 18, 2012

Question about tranpose

Related topics