unspecified launch failure kernel fails sometimes not everytime

Hi All,

First of all after months of work I am able to run my iterations on GPU.
No Doubt the result is very impressive.
But when i go ahead and try to increase the number of iterations the kernel fails with statement “unspecified launch failure”.
and surprisingly sometime it launches kernel successfully for the same number of iterations.

I must say that my kernel program is very much bulky and it is also not conflicting with CUDA restrictions like registers etc.
I searched the forums here and i got no rigid answer.
Its again not a problem of XP watchdog as it fails in just few millisecond.
Please let me know if there is any means by which i can know what the exact reason why CUDA is behaving in such a unprofessional manner.


One more information i wanted to pass -

Earlier i was working with global memory.

But my threads doesnot access the input data in an “one to one” manner which CUDA 1.0 capatibilty devices should.

As i read in the docs there is restrictions for global memory access for CUDA compatability minor version of 0.

I am having 1.0 so i later started to use constant memory instead of global meomory.

So now my observation is that it is failing less compared to the global meomry cases.

But it is still failing to launch now and then.



Probably out of bounds memory access somewhere. Unspecified launch errors are the equivalent of access violations or segfaults.

Well i am rechecking my entire code.

May be i ll need time to find if i am doing any segfaults etc!

But i wanted to update you that for same iteration and for same set of input data my kernel was running successfully “sometimes” with proper outputs.

Can there be any other reason for this type of launch failures?

Yes, another possibility might be bad hardware. But I would be verifying the code first. Try something like valgrind or GPU ocelot if you can. Ocelot, in particular, is fantastic for isolating improper memory use.

Having said that, hardare can cause what you are seeing. I had one particular 9500GT DDR3 card that worked perfectly until you pushed it past about 75% of peak memory bandwidth, in which case it started behaving very erratically, including random launch failures, driver errors, video ram corruption. Even in standard OpenGL benchmarks it would running happily for hours, but my CUDA code could make it start failing in minutes. Emulation with valgrind, Ocelot, cuda-gdb never helped find a bug with the code, and I was able to run it happily on other hardware. At the suggestion of someone here, I tried underclocking, and it helped a bit, but in the end put it down to bad hardware and gave up on it.

I am digging inside the code since last 12 hrs.

I have observed few things -

  • I am not able to check if there is any segfault “Till now”.

  • the “same set of iterations” is running and failing now and then.

  • I am getting “correct result from each threads” when the kernel launches successfully.

  • I am able to go beyond limit when i try to access data linearly from constant memory but is often fails when i try to access the data in a haphazard way from the constant memory.

avidday, I am still to use the tools u mentioned. I am trying to use it.

I ll update here once i confirm.

Anyway as i mentioned for same set sometimes it launches successfully with proper output that i might would have got if i would have run it in CPU.

But if i try to access the data linearly it is working fine :) and i can go beyond limit.

But the problem is when i try to access the input from here and there.

As i read in the docs CUDA compitability 1.0 has restrictions on this. Global memory cant be accessed is such a fashion.

So i switched to constant memory. Anyway Can this be a reason?

Bt avidday I want to thank you for all your help.



I mean to say, can it be a reason for “unspecified launch failure” if i disobey the rule mentioned in the attachment.

Please find the attachment.

Please let me know if you are not clear with my English.
GPU.bmp (1.49 MB)