Is the Algo wrong? ?

Below is the code which works as expected on emulator but on real GPU it gives garbage output. In fact all the values of x become zero! Is there any problem with algorithm? then why all values of X becomes zero? not getting the problem. Can anyone please help me.

global void vcal(int rows,int cols,int x,char a[3],char b[4])
{
// int j =2,i;
int left, top, leftTop;
int ix = blockIdx.x
blockDim.x+ (cols+1)*threadIdx.x ;

if(ix==0 || ix == 4 || ix == 8 || ix ==12 || ix == 16) return;
if(ix==1 || ix == 2 || ix == 3 ) return;

left = x[ix-1]-2;
top = x[ix-cols]-2;


leftTop = x[ix-cols-1];

if(a[ix%4]==b[ix/4])
	leftTop +=1;
else
	leftTop -=1;


if(left > top )
{
if(left > leftTop)
	x[ix] = left; 
else 
	x[ix]= leftTop;
}
else if (top > leftTop)
{
if(top>left)
  x[ix] = top;
else 
  x[ix] = left;
}
else
{
  x[ix] = leftTop;
}

}

Thank you.

Are you sure your kernel is even getting executed on the device? Post the code you are using to call the kernel, including how many blocks and threads per block you are using.

Also, make sure to do error checking around your kernel call…this will let you know if your kernel does not launch.

This is the Main program from where I am calling kernel.

int main()
{
int *host_x,*host_y;
int *dev_x;
int rows = 5 , cols =4;
cudaError_t err;
char a = “-AGC” ;
char b = “-AAAC”;

host_x = (int )malloc(rowscolssizeof(int));
host_y = (int )malloc(colsrows
sizeof(int));

for(i=0;i<rows*cols;i++)
host_x[i]=i;

cudaMalloc((void **)&dev_x,rowscolssizeof(int));
cudaMemcpy(dev_x,host_x,rowscolssizeof(int),cudaMemcpyHost
ToDevice);

vcal<<<25,1>>>(rows,cols,dev_x,a,b );

err = cudaGetLastError();

cudaMemcpy(host_y,dev_x,rowscolssizeof(int),cudaMemcpyDevi
ceToHost);

printf(“\nError :%s”,cudaGetErrorString(err));

}

I think that the kernel is getting executed as on simulator I am getting expected output. And on real machine when i remove the “if(a[ix%4]==b[ix/4])” line from kernel it executes properly.

please dont hit me for this! I am a beginner so asking these basic question!
I just found that I am using Cuda 2.1 for the emulator version and for GPU I am using 1.1 . so can that be the problem here? can the different version be the problem?

I can understand your fear of getting hit… being an Indian Living in Oz… :-)

The bug is in your passing of “a” and “b” to the kernel. They are CPU arrays.

They cant be passed to GPU kernel that way. you need to allocate the same in GPU, copy out the host array and then call the kernel with GPU pointers.

And,
You are prematuredly doing getLastError… I think so… Better do a “err = cudaThreadSynchronize()” immediately after kennel launch…

Gooood Luck!

char a[] = "-AGC";

char b[] = "-AAAC";

(...)

vcal<<<25,1>>>(rows,cols,dev_x,a,b );

You are passing pointers to a host memory as parameters for GPU function. Remember that when you pass an array as a parameter to a function, it is not copied - instead just the pointer to beginning of the array is passed.

Host pointers work in deviceemu mode, but do not in real mode.

Other thing is that launching a block of size 1 is not very efficient.

Edit: ahh… I was late with my answer :P

Thank you guys! thanks in tons!

lol! cant help it … but happy that the condition is far better in Perth than Sydney Melbourne!!

ya I got the problem ! I knew the theory but lacked in implementation! as always! and will surely use “err = cudaThreadSynchronize()” next time onwards!

Thank you pDan. Actually this was my first CUDA program so lots of loop holes are there.!