Computing with CUDA and its kernel configurations Some problems with the configuration

Hello everybody…

I am still a newbie with the CUDA programming and I use it for my master work. So I got some experience with it and I think sometimes I love it and sometimes I hate it. Actually I have discovered a strange behavior from CUDA. I checked my computations with different configurations and there are some things which I cannot understand. For example: I put for the configuration of the kernel two blocks with 512 threads in it and the computation fail. With another

configuration like two blocks and with 300 threads it works fine or 1024 blocks with one thread in it works fine, too. So maybe I didn’t understand something…

Here is the code:

__device__ 

float func(const float x, 

			const float x0, const float x1, 

			const float b0, const float b1, const float a)

{

	using namespace std;

	float h = x1 - x0;

	float tmp = 1.f / (2.f*h);

	return tmp*b1*pow(x-x0, 2.f) - tmp*b0*pow(x1-x, 2.f) + a;

}

__global__ 

void kernelSpline(float2 *ret, const int resN, float *cfA, float *cfB,

						const float2 *iPts, const int baseN)

{

	int id = blockIdx.x * blockDim.x + threadIdx.x;

	int threads = blockDim.x * gridDim.x;

	float2 pmin, pmax;

	float2 p0, p1, pt;

	

	pmin = iPts[0];	pmax = iPts[baseN-1];

	float t = (pmax.x - pmin.x) / resN;

	for(int i=id; i<resN; i+=threads) {

		pt.x = pmin.x + i*t;

		int index = findInterval(p0, p1, pt.x, iPts, baseN);

		pt.y = func(pt.x, p0.x, p1.x, cfB[index], cfB[index+1], cfA[index+1]);

		ret[i] = pt;

	}

}

I am calling the kernel from host code with the specific configuration and from the kernel the device function is called. My idea is that every thread is computing one element. Maybe there is a mistake in the compilation? Please help me to understand the error…

Thanks in advance.