I have an annoying problem that I don’t really know how to solve. I hope someone here has had similar problems or can give me a hint in the right direction.
I have a computer which has one GeForce GTX 260 and one Geforce GTX 295 in it, so I have three devices on which I can run. I run three seperate instances of my program, each on one device. Each instance has their own compiled code with some parameters set to different values. My program is “gpu-intense” so I have 1 Intel quad core for the gpus and I think that the cpu shouldn’t be any bottleneck. My problem is that after a while the runs are crashing, the time it takes before a crash can change very much, from a couple of minutes to hours. When the runs are “crashing” they seem to freeze and the problem seem to always happen when some of the runs run into a “4 (unspecified launch failure)”. Another weird thing is that if I run ‘top’ I see that I have 3 java processes that sits on 100 % ( 4 kernels in the cpu so it is possible ) but java shouldnt be doing anything at all at that time. Parts of the program Im using runs in java and then “communicate” with the C-code through JNI and the C-code ( Cuda code) is responsible for launching the kernel and so on. I would guess that the java code would just be at basically 0 % and just waiting on the cuda code to return its result but for some reason it sits on 100%.
If I run on only one device and let the other two do nothing I dont get any “freeze”. I’ve had a run going on for like 5 days on another computer without it freezes at least. It is possible thou that some “4 (unspecified launch failure)” has occured during that time because if one run crashes or is finished a new run is started ( running from a script ) . But those crashes does not occur frequently at least because I write some statistice after a run is over and if it crashed before it reached that point no statistics would have been written to file, and it has been written much there.
I find it very hard to find the error causing the behaviour because it only happens sometimes, and the failure message “4 (unspecified launch failure)” doesnt say me anything. And java sitting at 100% when the crash happens on the gpu ( at least I assume that ).
Im using cuda 2.3 but Ive had similar problem in earlier releases.
I dont post any code here right now because I dont know how much I would need to post and I dont want to clutter the message if it is not necessary. Im more intrested in general ideas what might be causing this or if someone has experienced similar problems.