problem while indexing using 2D threads and 2D blocks

Hello every one.

I am a begineer in cuda programming. I am using GeForce GTX 650 GPU for image processing. I am experimenting by copying the image from host to GPU and back. The problem i am facing is in the indexing part of the code. I have shared the snippet of the code below.

***************************************** HOST *********************
int img_dimx=257;
int img_dimy=257;

// i get the size of the image as 66049
int input[257][257];
int output[257][257];

dim3 threadID(32,32); // i want to use all the threads in the block and in 2D format.

dim3 blockID( (img_dimx + (threadID.x -1))/threadID.x , (img_dimy + (threadID.y -1))/threadID.y);

// from the above line, block size is 9x9
// total threads is 32329*9 = 82944 threads

***************************************** DEVICE *******************

int x = threadIdx.x + (blockIdx.x * blockDim.x);
int y = threadIdx.y + (blockIdx.y * blockDim.y);
int Gloc = x + (y * img_dimx); ----------(1)
// int Gloc = x + (y * blockDim.x * gridDim.x); ----------(2)

dev_out[Gloc] = dev_in[Gloc];


When i use Gloc = x + (y * img_dimx); ----------(1)

if i have x=0 and Y=1, then i get Gloc as 257. So in the output image it will be stored at the location (0,1) i,e 1st row and 0th column.

if i have x=3 and Y=2, then i get Gloc as 517. So in the output image it will be stored at the location (3,2) i,e 2nd row and 3rd column.

But when i use Gloc = x + (y * blockDim.x * gridDim.x); ----------(2)

if i have x=0 and Y=1, then i get Gloc as 288. So in the output image it will be stored at the location (31,1) i,e 1st row and 0th column.

if i have x=3 and Y=2 then i get Gloc as 517. So in the output image it will be stored at the location (59,2) i,e 2nd row and 3rd column.


The problem is when i use eq 1 the CUDA program hangs and when i use eq 2 it works correctly. Why is it so? It should be the other way around?

and eq 2 dosent hold good if the image size is not multiple of 32 or 16.

Why is this happening? Is there any fixed way to achieve the correct indexing when the size of the image is odd or even?

Sincerely, Demonferrari

If there are 82,944 threads started and only 66,049 should actually be working on memory… You MUST have an if statement in your kernel code or else it will act unpredictably. Use an if statement to block out the dev_out[Gloc] = dev_in[Gloc]; ex: “if Gloc < 66049” or whatever you find appropriate. Without the if statement, Gloc 66049 and above will attempt to work on memory that isn’t there…

If you are still having problems, please do post how input/output become allocated on the GPU and how the kernel is being called (and try to use code tags please :P).