multiple kernel calls from one host function strange behaviour when calling kernel

Hi everyone,

recently i tried to optimize my working code, and instead calling

void func(pType * p){

...

        pType *p1;

	cudaMalloc((void**)&p1, MN*sizeof(pType));

	for (int i = 0; i < JACOBI_ITERATIONS; i++)

	{

		

		jacobi_shared3<<<grid, block>>>(p1, p, rhs, dx, dy, alpha, beta, pitch);   //stores results to first argument

		cudaThreadSynchronize();

		cudaMemcpy(p, p1, sizeof(pType)*dx*dy, cudaMemcpyDeviceToDevice);

		

	}

	cudaFree(p1);

}

i tried something like this:

void func(pType * p){

...

        pType *p1;

	cudaMalloc((void**)&p1, MN*sizeof(pType));

	for (int i = 0; i < JACOBI_ITERATIONS/2; i++)

	{

		

		jacobi_shared3<<<grid, block>>>(p1, p, rhs, dx, dy, alpha, beta, pitch);      //stores results to first argument

		cudaThreadSynchronize();

		

		jacobi_shared3<<<grid, block>>>(p, p1, rhs, dx, dy, alpha, beta, pitch);

		cudaThreadSynchronize();

		

	}

	cudaFree(p1);

}

The problem is, I get a cudaError=30 (checked with cudeGetLastError()) when calling second cudaThreadSynchronize().

Why is that so? What am I doing wrong?

And will this change affect calculations performance?

I have a GTX275 with CUDA 3.2 installed.

Any help will be appreciated.

Where do you allocate memory for [font=“Courier New”]*p[/font] in the second version?

Oh, sorry, its allocated somewhere else by cudaMalloc and passed as a parameter just like in the first version. I just posted it incorrectly, but its fixed now.

Then I don’t know, as long as you allocate enough memory for [font=“Courier New”]*p[/font].

It looks a bit like you have an out-of-bounds access somewhere in the kernel.

its declared on a global scope as:

pType *pField = NULL; //pressure

and allocated:

cudaMalloc((void**) &pField, sizeof(pType) * MN);