GPU does not operate correctly because GPU runs extremely slower and never advance quickly than it used to be. Nvidia-smi loses one gpu display also

sekigh · May 25, 2021, 5:37am

I face a weird phenomena that my program using one GPU(cuda device number = 1) runs tremendously slow and never advances quickly. I would like to know what is the cause and how to workaround it. In this situation, I see somehow that the display of content in cuda device number 0 in nvidia-smi is totally gone (even the display frame for cuda 0 is gone).(see 20210525_nvidia-bug-report.log.gz) Note that cuda 0 is idle all the time. This is the phenomena when I reconnected ssh to the server where this program runs after a few days running. I definitely see the program go ahead but extremely slower than it used to be in the earlier stage of this running. The cause lies in gpu execution because my program continues to run GPUtil() instruction repeatedly there, printing current occupied gpu memory volume.(see 20210525_My_program_coninues_to_run_in_a_loop_(printing_GPUtil)png) I attached nvidia-bug-report.log.gz and several screen shots. Thank you for your help in advance.

sekigh

My sever profile:
Running on Docker: Docker version 19.03.5, build 633a0ea838
python framework: pytorch1.8.0

Ubuntu version:
NAME=“Ubuntu”
VERSION=“18.04.3 LTS (Bionic Beaver)”
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME=“Ubuntu 18.04.3 LTS”
VERSION_ID=“18.04”
HOME_URL=“https://www.ubuntu.com/”
SUPPORT_URL=“https://help.ubuntu.com/”
BUG_REPORT_URL=“https://bugs.launchpad.net/ubuntu/”
PRIVACY_POLICY_URL=“Data privacy | Ubuntu”
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

GeForce GTX 1080 Ti x 2pcs
cuda version: Cuda compilation tools, release 9.2, V9.2.148
cudnn version: 7.6.5

20210525_nvidia-bug-report.log.gz (1.4 MB)

The end of the text