Asynchronous calls to GPU in a 'for' loop - unspecified launch failure

Hello,
I am trying to do something very similar to CUDA sample code AsyncAPI. I am trying to make asynchronous call to GPU kernel from the CPU and in the meanwhile also running something on the CPU in parallel. In the AsyncAPI example they have shown the use of Stream and Event for this purpose. When I try to do the same thing, but in a ‘for’ loop, I get “unspecified launch failure”. I am posting below a snippet of the code.

The code runs properly for the iteration 0 but fails in iteration 1 giving “unspecified launch failure” at “cutilSafeCall( cudaEventElapsedTime(&gpu_time, start, stop) );”. Even this pattern is very random as sometimes it fails in iteration 1 and sometimes in iteration 2, but it always run fine in iteration 0.

Has anybody tried doing asynchronous kernel calls within a loop before? Can anybody help me please!!!

// snippet of the code - the code compiles properly

cudaEvent_t start, stop;
unsigned long int tester = 0;

for(i = 0; i < 3; i++)
{
cutilSafeCall( cudaEventCreate(&start) );
cutilSafeCall( cudaEventCreate(&stop) );

	// asynchronously issue work to the GPU (all to stream 0)		
	cutilCheckError(  cutResetTimer(timer)    );
	cutilSafeCall( cudaThreadSynchronize() );

	cutilCheckError( cutStartTimer(timer) );
	   cudaEventRecord(start, 0);
	    	cudaMemcpyAsync(d_a, a, nbytes, cudaMemcpyHostToDevice, 0); 
		cuda_kernel<<<nblocks, nthreads, 0, 0>>>(d_a, db, dc);  
	   cudaEventRecord(stop, 0);	    	    
	cutilCheckError( cutStopTimer(timer) );

	// have CPU do something while waiting for stage 1 to finish
	while( cudaEventQuery(stop) == cudaErrorNotReady ){
	    tester++;	}			

	cutilSafeCall( cudaEventElapsedTime(&gpu_time, start, stop) );
    	cutilSafeCall( cudaEventDestroy(start) );
    	cutilSafeCall( cudaEventDestroy(stop) );

} // for loop ends

thanks.

Due to the asynchronous launch, if your kernel generates an error (memory access violation is a common source of “unspecified launch failure”), the error will appear in the return value from a future CUDA call. The varying behavior you are seeing is probably just the non-deterministic delay between kernel launch and the error on the device. I would take a look at your kernel again and see if you can spot some kind of memory problem.

Due to the asynchronous launch, if your kernel generates an error (memory access violation is a common source of “unspecified launch failure”), the error will appear in the return value from a future CUDA call. The varying behavior you are seeing is probably just the non-deterministic delay between kernel launch and the error on the device. I would take a look at your kernel again and see if you can spot some kind of memory problem.

Thanks seibert for the quick reply. I do not think there should be any issue with the kernel, because the kernel runs fine in the synchronous GPU call. I have tested it thoroughly. Also, if there was some issue with the cuda kernel then it should also randomly fail in the iteration 0 along with 1 and 2. Why is it that it never gives any trouble with 0th iteration but fails later on?

Thanks seibert for the quick reply. I do not think there should be any issue with the kernel, because the kernel runs fine in the synchronous GPU call. I have tested it thoroughly. Also, if there was some issue with the cuda kernel then it should also randomly fail in the iteration 0 along with 1 and 2. Why is it that it never gives any trouble with 0th iteration but fails later on?

Hmm, I see your point. Not sure I have any more ideas, then.

Hmm, I see your point. Not sure I have any more ideas, then.