Weird error in CUDA code. Please help

Hi.

I have a C and a CUDA code, which in my head should do exactly the same, but the CUDA code returns a zero-matrix with the same input data as the C-code, but the C-code gives me the right result. Can anyone see what’s wrong with my CUDA code?

CUDA:

__global__ void reformat(cufftComplex *matrix, cufftComplex *ans, float *range,

						 float *th_vec, float r_max, int NY, int NX)

{

	int mx = blockIdx.x*blockDim.x + threadIdx.x;

	int my = blockIdx.y*blockDim.y + threadIdx.y;

	if(mx < NX && my < NY) {

		float x = mx*2*r_max / NX - r_max;

		float y = my*2*r_max / NY - r_max;

		if(x == 0)

			x = epsilon;

		float r = sqrt(x*x + y*y);

		if(r < r_max && r > -r_max) {

			float theta = atan(y/x);

			if(x < 0)

				theta += pi;

			if(theta < 0)

				theta += 2*pi;

			int xval = -1;

			int yval = -1;

			float error = 99999;

			for(int i = 0; i < NX; ++i) {

				float xerr = th_vec[i] - theta;

				for(int j = 0; j < NY; ++j) {

					float yerr = range[j] - r;

					float err = xerr*xerr + yerr*yerr;

					if(err < error) {

						error = err;

						xval = i;

						yval = j;

					}

				}

			}

			ans[my + mx*NY].x = matrix[yval + xval*NY].x;

			ans[my + mx*NY].y = matrix[yval + xval*NY].y;

		} else {

			ans[my + mx*NY].x = ans[my + mx*NY].y = 0.0f;

		}

	}

}

C:

for(int my = 0; my < NY; ++my) {

	for(int mx = 0; mx < NX; ++mx) {

		double x = mx*2*r_max / NX - r_max;

		double y = my*2*r_max / NY - r_max;

		if(x == 0)

			x = epsilon;

		double r = sqrt(x*x + y*y);

		if(r < r_max && r > -r_max) {

			double theta = atan(y/x);

			if(x < 0)

				theta += pi;

			if(theta < 0)

				theta += 2*pi;

			int yval = -1;

			int xval = -1;

			double error = 99999;

			for(int i = 0; i < NY; ++i) {

				for(int j = 0; j < NX; ++j) {

					double xerr = th_vec[j] - theta;

					double yerr = range[i] - r;

					double err = xerr*xerr + yerr*yerr;

					if(err < error) {

						error = err;

						xval = j;

						yval = i;

					}

				}

			}

			rans[my + mx*NY] = real[yval + xval*NY];

			ians[my + mx*NY] = imag[yval + xval*NY];

		}

	}

}

I really can’t find the error. But regardless of input, the CUDA code returns a zero matrix.

EDIT: If I run it in device emulation mode, it looks like the program don’t even enter the kernel. If I try to print something out, it won’t. I’m calling the program from matlab, and therefor haveto use mexPrintf, but this shouldn’t stop the program for printing out from the kernel? It also uses less time when in emulation mode then in normal mode, so something is wrong there.

EDIT2: The program itselv works, but only if I have a 417x417 matrix or less. If I then add a column, the output is wrong. Could this be a lack of memory on the GPU? It sounds weird though, I have a GeForce 8800 Ultra with 804978688 bytes of global memory, and 16384 bytes of shared memory. Can’t see how I can exceed that with my code, but maybe someone can help me with the reason?

Can no one help with this one?