program running at k80 is interrupted with no reason

My program running at k80 with openmpi. While running,the usage of memory is less than 30%(both CPU and GPU). The temperature of k80 is no more than 50 degree(displayed by nvidia-smi). Hard disk storage space is enough. The program can run a few loops,and the progress of interruption is not the same. There is no error information displayed. I have changed many version of CUDA,such as 5.0,5.5,9.1. Every time I change the CUDA,I restart the cluster and recompile the program. This program can run successfully at M2090 with CUDA 5.0,K10 with CUDA 5.0,and K20 with CUDA 6.5 .
Who knows how to fix it?

I am afraid that so far, there is too little information given to even determine a failure mode, let alone tell you what is happening and why.

I have no idea what this means. How exactly does the error manifest itself? Does the program terminate abnormally, without an error message? Have you tried augmenting the application with a rudimentary debug logging feature so you can see at which point in the application execution stops?

It is possible for programs to contain bugs that do not manifest on particular hardware, or with older CUDA versions whose compilers do not optimize as aggressively. Are any issues reported when you run the code under control of cuda-memcheck?

Can you share the code for the program with use or is it quite large, or proprietary?

Thank you for your advice.The program DO terminate abnormally without an error message.There’s only message is about mpi : “mpirun has exited due to process rank 3 with PID 10155 on node …”. The code is large and proprietary,so i’m sorry i can’t share all the code.But i’ll try your advice to locat the point where the program terminates. Then i can share some part of the code with the break point.

In that case, I doubt much can be done in terms of remote diagnosis. Use standard debugging techniques to narrow down the immediate cause of the termination, then track backwards from there to find the root cause. My best guess is you are looking at a race condition of some kind.

I assume this is a cluster of Linux boxes you are running on, and all systems are headless, so we wouldn’t expect trouble due to kernels being killed by watch dog timer events. I also assume you have established that the GPUs in the cluster have been shown to work properly when running simple test apps on each of the individual machines.

If you can get assistance from someone local to you who can also look at the code, that would probably be the best kind of assistance. Diagnosing problems in unseen code over the internet is a bit like a car mechanic diagnosing car trouble over the phone without access to the car.