Looping kernel calls Unspecified launch error on cudaFree() ??

hdinh · April 1, 2009, 12:51am

Hi!

I am calling a kernel in the following function, RunBatchGPU(), which get looped in my main function.

CutilSafeCall returns the error: unspecified launch failure at cudaFree(d_breads) in the following code and it won’t finish the first loop and I can’t figure out why. I got the same error when I declared d_breads globally as well.

Edit: when I took out this line, it completed the call the first time, but at the second loop it returned the same error but at Memcpy2D

extern "C" void 

RunBatchGPU(int num, char ** breads)

{

	//std::cout<<"My breads: "<<breads[0][5]<<endl;

	char* d_breads;

	size_t p_breads;

	checkCUDAError("before breads allocation");

	cutilSafeCall(cudaMallocPitch((void**) &d_breads, &p_breads, 62*sizeof(char), num));

	cutilSafeCall(cudaMemcpy2D(d_breads, p_breads, breads, 62*sizeof(char), 62*sizeof(char), num, cudaMemcpyHostToDevice));

	dim3 grid(num, 1, 1);

	dim3 threads(1, 1, 1);

	// execute the kernel

	   checkCUDAError("before 1kernel invocation");

	kernel_RunAlign<<<grid, threads, 1>>>( d_breads , p_breads);

	   

	cutilSafeCall(cudaFree(d_breads));

	checkCUDAError("after 1kernel invocation");

	cout<<"Ready to do StringAlign"<<endl;

}

tmurray · April 1, 2009, 12:54am

unspecified launch failure = segfault in your kernel. valgrind plus deviceemu to the rescue!

hdinh · April 19, 2009, 8:53pm

Thanks!

You’ve probably had to write this one too many times on this forum. I’ve finally figured out how to get emu mode on and valgrind to give me some meaningful hints.

For the sake of clarification to other beginners:

I’ve just added -deviceemu to the nvcc call. Downloaded valgrind, and ran the program with: valgrind --leak-check=full ./programName

Sarnath · April 20, 2009, 5:51am

You are freeing memory before the kernel execution completes. Thats why.

Kernel launches are asynchronous. You need to use “cudaThreadSynchronize()” to makes sure that kernel launches have completed BEFORE freeing anything

tmurray · April 20, 2009, 7:17am

DFGLKJNDFLKBJNDFLKVJNLDSKFJVNLSKJNFAKLJNADF

how many times have we told you that THIS IS NOT TRUE.

I wrote a test app to prove it. First time through, record an event, launch a kernel, record an event. Second time through, record event, launch kernel, free, record an event. Compare elapsed time. If cudaFree doesn’t block, then for a reasonably sized kernel (this one takes 25ms on my MBP’s 8600GT) it will just plow right through, invalidate a bunch of stuff, and cause general badness.

guess what: that doesn’t happen because it would be stupid! an intentionally confusing API design unless you have some sort of really crazy event system!

(for the record, 25ms for kernel and 31ms for kernel plus free)

Josh Anderson flips out every time you say this because it’s blatantly wrong, so I guess it’s my turn. CUDA inserts synchronization wherever it needs it. Two kernels in a row? Implicit sync barrier. cudaMemcpy immediately after a kernel? Implicit sync and can’t overlap because it’s in stream 0. cudaMemcpyAsync? Gotta do it yourself because maybe you want to start copying immediately–it’s async and sometimes there is overlap so you can do that! The time where you would need to insert some sort of manual synchronization primitive is basically whenever you are using zero-copy or async, but when things can’t happen simultaneously, you don’t need to insert a friggin synchronization barrier. You don’t need to insert a cudaThreadSynchronize between each kernel launch (because this would be stupid), you don’t need to insert a cudaThreadSynchronize between a kernel and a memcpy (because this would be stupid), and you don’t need to insert a cudaThreadSynchronize between any memory management function and a kernel. (I don’t even think you have to with zero-copy, but I haven’t tried it to be sure)

Done ranting.

robOn · May 13, 2009, 8:54am

DFGLKJNDFLKBJNDFLKVJNLDSKFJVNLSKJNFAKLJNADF

how many times have we told you that THIS IS NOT TRUE.

I wrote a test app to prove it. First time through, record an event, launch a kernel, record an event. Second time through, record event, launch kernel, free, record an event. Compare elapsed time. If cudaFree doesn’t block, then for a reasonably sized kernel (this one takes 25ms on my MBP’s 8600GT) it will just plow right through, invalidate a bunch of stuff, and cause general badness.

guess what: that doesn’t happen because it would be stupid! an intentionally confusing API design unless you have some sort of really crazy event system!

(for the record, 25ms for kernel and 31ms for kernel plus free)

Josh Anderson flips out every time you say this because it’s blatantly wrong, so I guess it’s my turn. CUDA inserts synchronization wherever it needs it. Two kernels in a row? Implicit sync barrier. cudaMemcpy immediately after a kernel? Implicit sync and can’t overlap because it’s in stream 0. cudaMemcpyAsync? Gotta do it yourself because maybe you want to start copying immediately–it’s async and sometimes there is overlap so you can do that! The time where you would need to insert some sort of manual synchronization primitive is basically whenever you are using zero-copy or async, but when things can’t happen simultaneously, you don’t need to insert a friggin synchronization barrier. You don’t need to insert a cudaThreadSynchronize between each kernel launch (because this would be stupid), you don’t need to insert a cudaThreadSynchronize between a kernel and a memcpy (because this would be stupid), and you don’t need to insert a cudaThreadSynchronize between any memory management function and a kernel. (I don’t even think you have to with zero-copy, but I haven’t tried it to be sure)

Done ranting.

Hi to everybody (my first post),

i’m in deed trying to loop kernel calls:

for(int i=0; i<someSize; i++)

{

someKernel_1<<< x, y >>> (i, some args);

cudaThreadSynchronize();

}

if i don’t use cudaThreadSynchronize() after each kernel call i get the frightening windows-bluescreen and my computer starts rebooting. but if i put a second different kernel into my loop like …

for(int i=0; i<someSize; i++)

{

someKernel_1 <<< x, y >>> (i, some args);

someKernel_2 <<< x, y >>> (i, some args);

}

… then i don’t need to call cudaThreadSynchronize(). i don’t really understand why … where is the implicit sync barrier?

thanks in advance, robert

Topic		Replies	Views
cudaFree painfully slow CUDA Programming and Performance	4	4605	January 29, 2010
using cudaMalloc and cudaFree within a loop unspecified launch failure! CUDA Programming and Performance	21	37754	April 23, 2009
cudaFree() error + loop CUDA Programming and Performance	1	6691	April 1, 2010
Unspecifiec launch failure on CUDA_SAFE_CALL(cudaThreadSynchronize()) CUDA Programming and Performance	5	2145	January 27, 2011
Problem freeing memory CUDA Programming and Performance	2	1302	July 3, 2009
cudaFreeArray CUDA Programming and Performance	0	3330	February 26, 2008
Calling kernel in a loop spends much time in cudaFree CUDA Programming and Performance	1	784	July 16, 2018
Synchronization synchronizing a n body problem. CUDA Programming and Performance	8	4321	September 22, 2009
Cuda KERNEL_LAUNCH_FAILED when I call the same kernel immediately after the previous call took place CUDA Programming and Performance	4	1060	December 14, 2012
Weird behavior in kernel calls. Related to asynchronous & synchronous instructions CUDA Programming and Performance	0	508	August 2, 2011

Looping kernel calls Unspecified launch error on cudaFree() ??

Related topics