unspecified launch failure ~2 second durating random freezes

Hello everybody,

I’m currently accelerating some image processing operations with CUDA and running a function which is doing 3 Kernel calls. Each call lasts about 600ms on really big images so i should not have a problem with the 5s limit.

I’m currently using one image all the time.
Most of the time everything is working fine, but sometimes the program throws “unspecified launch failure” and i get a corrupted result image.
(again: some seconds ago the same image worked well, and in the next call it is working again most of the times. crashes seem to be really random External Image ).

Just an empiric fact: The bigger the images are, the more often execution fails.

The position where these errors begin to appear also seem to vary randomly.

I have really no idea what’s going wrong.

In the corrupted cases program execution also takes about two seconds longer.

Thanks for your help,
xlro.

Might there be some problems with windows? I only have one GF8800 gts graphics card.

No second card for the operating system.

Of course i synchronize via

CUDA_SAFE_CALL( cudaThreadSynchronize() );

after each Kernel function call.

I’ve seen this behavior many times. Things I know that cause it:

  1. Writing past the end of an array in memory on the GPU. Check carefully for these cases, or you can run a small test case through valgrind (linux only) with your code compiled in GPU emulation mode.

  2. A CUDA bug. Here is a previous discussion on it. [url=“http://forums.nvidia.com/index.php?showtopic=59188”]http://forums.nvidia.com/index.php?showtopic=59188[/url]
    I trigger the bug the same way you do: running the same kernel on the same data over and over again. It randomly fails with unspecified launch failure every ~10,000 calls. It is very subtle and can sometimes disappear with the slightest change to the kernel code.

edit: fixed formatting.

Also, you don’t need to synchronize after every kernel call unless you are performing wall clock benchmark timing. The driver inserts implicit syncs where they are needed (i.e. when you read memory)