kernel printf strange behaviour of printf in __global__ sub

Hello, I have a strange behaviour of the cudaThreadSynchronize() function.
This is my source code and what it happens:

for (t=0;t<10;t++) {
	runneus<<<dim3(GRID / TBX, SLICES / TBY),dim3(TBX,TBY)>>>(neus,rnds,100,70,10.0);
	printf("%s\n",cudaGetErrorString(cudaGetLastError()));
	runsyns<<<dim3(GRID / CHX, SLICES),dim3(CHX ,NPRE)>>>(neus);
	printf("%s\n",cudaGetErrorString(cudaGetLastError()));
	reduces<<<dim3(GRID / CHX, SLICES),dim3(CHX ,NPRE)>>>(neus);
	printf("%s\n",cudaGetErrorString(cudaGetLastError()));
	getch();

	//cudaThreadSynchronize();
	printf("\n");
}

in this way no error is reported by the cudaGetLastError().

When I enable (erase the // remark) this is the output:

no error
no error
no error

unknown error
unknown error
unknown error

unknown error
unknown error
unknown error

unknown error
unknown error
unknown error

unknown error
unknown error
unknown error

unknown error
unknown error
unknown error

unknown error
unknown error
unknown error

unknown error
unknown error
unknown error

unknown error
unknown error
unknown error

unknown error
unknown error
unknown error

does anyone solved this error? What’s happening?
thankyou…

Neither kernel launches, nor cudaGetLast error are blocking functions, and cudaGetLasterror only returns the error state of the CUDA runtime at the time it is called. What is probably happening is that the first kernel launches OK, then the others two queued successfully. On most sane platforms a kernel launch only takes about 10 microseconds, and I guess your execution time is considerably longer than that. The first kernel execution later aborts, either leaving an error message with the runtime, or killing the context altogether. The cudaThreadSynchronize call (which is blocking) forces the host to wait until the kernels are done, then the next loop trip calls cudaGetLasterror again and you get to see the error.

For debugging a kernel launch you should do something like this:

unneus<<<dim3(GRID / TBX, SLICES / TBY),dim3(TBX,TBY)>>>(neus,rnds,100,70,10.0);

printf("%s %s %s\n", __FILE__, __LINE__, cudaGetErrorString(cudaGetLastError()));

printf("%s %s %s\n", __FILE__,__LINE__,cudaGetErrorString(cudaThreadSynchronize()));

Â

for every kernel call you launch. That will serialize each launch and tell you which launch is failing and might give more information about the error. Once every thing works, strip out the blocking calls.