Cuda produces wrong result

Hi everyone, I have a GeForce 9500 GPU (I know it’s old, my new laptop will come within a week :D) and openSUSE 11.4 64bit and have written a simple CUDA program to solve poisson equation on a 2d grid.

here is the beginning of device code:

__global__ void CudaPoissonSolve(float* rho, float* out, int numPoints, float delta)

{

	int x = threadIdx.x + blockIdx.x * blockDim.x;

	int y = threadIdx.y + blockIdx.y * blockDim.y;

	int idx = x + y * blockDim.x * gridDim.x;

and here’s how I call this code

extern "C" __host__ void CudaCaller(float* rho, float* out, int numPoints, float delta)

{

        //Producing Wrong Results

	dim3 grids((numPoints+15)/16, (numPoints+15)/16);

	dim3 threads(16, 16);

//Producing Correct result, but very slow

/*	dim3 grids(numPoints, numPoints);

	dim3 threads(1,1);*/

	

	float *dev_rho, *dev_out;

	CUCHECK(cudaMalloc((void**)&dev_rho, numPoints * numPoints * sizeof(float)));

	CUCHECK(cudaMalloc((void**)&dev_out, numPoints * numPoints * sizeof(float)));

	CUCHECK(cudaMemcpy(dev_rho, rho, numPoints * numPoints * sizeof(float), cudaMemcpyHostToDevice));

	CudaPoissonSolve<<<grids,threads>>>(dev_rho, dev_out, numPoints, delta);

	CUCHECK(cudaMemcpy(out, dev_out, numPoints * numPoints * sizeof(float), cudaMemcpyDeviceToHost));

	

	CUCHECK(cudaFree(dev_rho));

	CUCHECK(cudaFree(dev_out));

}

I know it’s not optimal and not written very well, it’s just for my learning process! anyway the problem is if I comment out the definition of grid and threads and replace it by the commented ones the code works correctly but slowly.

but when I use this version it produces the incorrect result. Using cuda-gdb I figured out that the problem is in the

int y = threadIdx.y + blockIdx.y * blockDim.y;

it calculates the wrong y, for example threadIdx = 3, blockIdx.y = 1, blockDim.y = 16, the result should be 3 + 16*1 = 19 but it returns only 16 (for others it calculates wrong y too).

btw Compute Capability is 1.1 and cuda version 3.2.

I was wondering what is it that I’m doing wrong, since I can’t believe that the compiler produces incorrect code in this simple case.

is numPoints multiple of 16?

I don’t understand why y = … is wrong. Are you sure the error comes from value of y?

No. currently it’s set to 81, but there’s a code for checking array bounds later in the function. Actually it’s strange for me too but cuda-gdb shows something’s wrong in calculating y.

I even checked it by “print threadIdx.y + blockIdx.y * blockDim.y” and “print y” but they give me different results.

int idx = x + y * blockDim.x * gridDim.x;

should be
int idx = x + y * numPoints

Thanks, you’re right. but as I said the y value is calculated incorrectly, so it would be wrong anyway. I’m gonna try to change the code a little bit…to see if the bug goes away or not.

WOW It worked!!! Thanks mfatica!! you’re my savior. I still can’t understand how gdb reported wrong results in y!!