I have met a very annoying problem:
we have access to a tesla GPU cluster. However, sometime, the GPUs on some nodes get stuck. It acts like the GPU kernel goes into an infinite loop. The cursor just blink but without any output when I interactive access the node. While I submit the job with qsub, it always gets killed because the time exceeds the request time. I have tried sample projects in CUDA sdk but still get the same thing. The strange thing is that: this happens to the GPUs randomly. Sometime, the GPU can execute the kernel successfully but sometime it gets stuck.
Anyone here has met this similar problem? I feel it might be due to heat or power issue (like when the GPU is too hot, it gets stuck).