Program killed

when I ran the python program on telsa m4, i have encounted the problem of “killed”.
the runing environment is :

Tesla M4 4G
OS ubnutun 16.04 linux server
Driver Version: 440.64.00
CUDA Version: 10.2

The program only consume 500M/4G on GPU memory, and 2G/126G on ram.
cpu load usage is around 26/48. In other words, the sever has 126G ram and 48 logical cpus

Everything is OK. However it get killed around half an hour.

I check with “sudo fuser -v /dev/nvidia*”, there are no other programs.

set Option “Interactive” “0” in /etc/X11/xorg.conf.

Thanks a lot.

“killed” is not sufficiently specific information to deduce anything from it. Killed how? Killed by whom?

Did the application terminate itself abnormally (“ABEND”) or was the process killed by an outside agent? If you cannot tell, you might want to add a status log to the app and also monitor signals. If you have not done so yet, add rigorous checking of the return status of every CUDA API call and every kernel launch.

Anything interesting reported by dmesg? Are you running on a physical box, or is this a virtual system? Is the application running free-standing on top of the OS, or under control of some other software (e.g. in a queue of a scheduler like LSF) that imposes strict limits on run time?

Thank you very much!

This program is written with pytorch on a physical box,
free-standing on top of the OS. Nothing interesting reported by dmesg. I even canot find “killed” message on dmesg.

Also, it is the source code downloaded from github, it can run elsewhere. I also noticed that, if nvidia-docker is started, even there is no program runing on top of nvidia-docker, this pytorch program will be killled after a couple of minutes.

Since it sounds like you are sitting in front of the physical system, you would want to increase observability measures (e.g. a log, diagnostic messages) to the software to narrow down how and where it dies.

Is other software sending it a signal? Is it terminating itself due to an internal error condition? Maybe something can be learned from observing ps while the app is running. Or try a tracing utility like strace to see what the app is doing.

Maybe you’ll find that the app always dies at a particular phase of processing, and you can drill down on that.

If there is someone with more debugging experience that can help you, ask that person so they can sit with you in front of the application and the system to help you figure out what is going on. This is not really possible over the internet.