running GPU kernel on GPU cluster


I have met a very annoying problem:

we have access to a tesla GPU cluster. However, sometime, the GPUs on some nodes get stuck. It acts like the GPU kernel goes into an infinite loop. The cursor just blink but without any output when I interactive access the node. While I submit the job with qsub, it always gets killed because the time exceeds the request time. I have tried sample projects in CUDA sdk but still get the same thing. The strange thing is that: this happens to the GPUs randomly. Sometime, the GPU can execute the kernel successfully but sometime it gets stuck.

Anyone here has met this similar problem? I feel it might be due to heat or power issue (like when the GPU is too hot, it gets stuck).

you certainly need to get the manager of that cluster involved in this investigation.

yes, we do contact the staffs but have not got any response yet. Just want to know whether others have met the same situation before.

What kind of cluster is it?

Can you give more details?