Looping kernel calls Unspecified launch error on cudaFree() ??


I am calling a kernel in the following function, RunBatchGPU(), which get looped in my main function.

CutilSafeCall returns the error: unspecified launch failure at cudaFree(d_breads) in the following code and it won’t finish the first loop and I can’t figure out why. I got the same error when I declared d_breads globally as well.

Edit: when I took out this line, it completed the call the first time, but at the second loop it returned the same error but at Memcpy2D

extern "C" void 

RunBatchGPU(int num, char ** breads)


	//std::cout<<"My breads: "<<breads[0][5]<<endl;

	char* d_breads;

	size_t p_breads;

	checkCUDAError("before breads allocation");

	cutilSafeCall(cudaMallocPitch((void**) &d_breads, &p_breads, 62*sizeof(char), num));

	cutilSafeCall(cudaMemcpy2D(d_breads, p_breads, breads, 62*sizeof(char), 62*sizeof(char), num, cudaMemcpyHostToDevice));

	dim3 grid(num, 1, 1);

	dim3 threads(1, 1, 1);

	// execute the kernel

	   checkCUDAError("before 1kernel invocation");

	kernel_RunAlign<<<grid, threads, 1>>>( d_breads , p_breads);



	checkCUDAError("after 1kernel invocation");

	cout<<"Ready to do StringAlign"<<endl;


unspecified launch failure = segfault in your kernel. valgrind plus deviceemu to the rescue!


You’ve probably had to write this one too many times on this forum. I’ve finally figured out how to get emu mode on and valgrind to give me some meaningful hints.

For the sake of clarification to other beginners:

I’ve just added -deviceemu to the nvcc call. Downloaded valgrind, and ran the program with: valgrind --leak-check=full ./programName

You are freeing memory before the kernel execution completes. Thats why.

Kernel launches are asynchronous. You need to use “cudaThreadSynchronize()” to makes sure that kernel launches have completed BEFORE freeing anything


how many times have we told you that THIS IS NOT TRUE.

I wrote a test app to prove it. First time through, record an event, launch a kernel, record an event. Second time through, record event, launch kernel, free, record an event. Compare elapsed time. If cudaFree doesn’t block, then for a reasonably sized kernel (this one takes 25ms on my MBP’s 8600GT) it will just plow right through, invalidate a bunch of stuff, and cause general badness.

guess what: that doesn’t happen because it would be stupid! an intentionally confusing API design unless you have some sort of really crazy event system!

(for the record, 25ms for kernel and 31ms for kernel plus free)

Josh Anderson flips out every time you say this because it’s blatantly wrong, so I guess it’s my turn. CUDA inserts synchronization wherever it needs it. Two kernels in a row? Implicit sync barrier. cudaMemcpy immediately after a kernel? Implicit sync and can’t overlap because it’s in stream 0. cudaMemcpyAsync? Gotta do it yourself because maybe you want to start copying immediately–it’s async and sometimes there is overlap so you can do that! The time where you would need to insert some sort of manual synchronization primitive is basically whenever you are using zero-copy or async, but when things can’t happen simultaneously, you don’t need to insert a friggin synchronization barrier. You don’t need to insert a cudaThreadSynchronize between each kernel launch (because this would be stupid), you don’t need to insert a cudaThreadSynchronize between a kernel and a memcpy (because this would be stupid), and you don’t need to insert a cudaThreadSynchronize between any memory management function and a kernel. (I don’t even think you have to with zero-copy, but I haven’t tried it to be sure)

Done ranting.

Hi to everybody (my first post),

i’m in deed trying to loop kernel calls:

for(int i=0; i<someSize; i++)


someKernel_1<<< x, y >>> (i, some args);



if i don’t use cudaThreadSynchronize() after each kernel call i get the frightening windows-bluescreen and my computer starts rebooting. but if i put a second different kernel into my loop like …

for(int i=0; i<someSize; i++)


someKernel_1 <<< x, y >>> (i, some args);

someKernel_2 <<< x, y >>> (i, some args);


… then i don’t need to call cudaThreadSynchronize(). i don’t really understand why … where is the implicit sync barrier?

thanks in advance, robert