First of all after months of work I am able to run my iterations on GPU.
No Doubt the result is very impressive.
But when i go ahead and try to increase the number of iterations the kernel fails with statement “unspecified launch failure”.
and surprisingly sometime it launches kernel successfully for the same number of iterations.
I must say that my kernel program is very much bulky and it is also not conflicting with CUDA restrictions like registers etc.
I searched the forums here and i got no rigid answer.
Its again not a problem of XP watchdog as it fails in just few millisecond.
Please let me know if there is any means by which i can know what the exact reason why CUDA is behaving in such a unprofessional manner.
Yes, another possibility might be bad hardware. But I would be verifying the code first. Try something like valgrind or GPU ocelot if you can. Ocelot, in particular, is fantastic for isolating improper memory use.
Having said that, hardare can cause what you are seeing. I had one particular 9500GT DDR3 card that worked perfectly until you pushed it past about 75% of peak memory bandwidth, in which case it started behaving very erratically, including random launch failures, driver errors, video ram corruption. Even in standard OpenGL benchmarks it would running happily for hours, but my CUDA code could make it start failing in minutes. Emulation with valgrind, Ocelot, cuda-gdb never helped find a bug with the code, and I was able to run it happily on other hardware. At the suggestion of someone here, I tried underclocking, and it helped a bit, but in the end put it down to bad hardware and gave up on it.
I am not able to check if there is any segfault “Till now”.
the “same set of iterations” is running and failing now and then.
I am getting “correct result from each threads” when the kernel launches successfully.
I am able to go beyond limit when i try to access data linearly from constant memory but is often fails when i try to access the data in a haphazard way from the constant memory.
avidday, I am still to use the tools u mentioned. I am trying to use it.
I ll update here once i confirm.
Anyway as i mentioned for same set sometimes it launches successfully with proper output that i might would have got if i would have run it in CPU.
But if i try to access the data linearly it is working fine :) and i can go beyond limit.
But the problem is when i try to access the input from here and there.
As i read in the docs CUDA compitability 1.0 has restrictions on this. Global memory cant be accessed is such a fashion.
So i switched to constant memory. Anyway Can this be a reason?