Environments and configurations:
- ubuntu 18.04
- 4 2080ti GPUs plugged and used in the early month this year. Later, 2 GPUs are unplgged and nvidia driver is re-installed as attempts to solve the system-freezing problem.
- Nvidia driver version: 440.36
- CUDA version: 10.2
Problem and my observation:
- At the very beginning (Jan&Feb this year), everything works fine when using the four GPUs.
- Later, I found when using the first GPU (GPU index:0) (run deep learning model training), the system will hang after several training epochs.
- As time goes, the phenomenon become severe and the system hangs immediately as the training starts.
- We tried to re-install nvidia driver, unplugged the first GPU, but the new GPU (index:0) again has this problem. The rest GPUs are stable. The temperature and power supply are normal.
- Other attemps we tried that do not solve the problem:
- using different CUDA verision from 10.0, 10.1 to 10.2.
- using PyTorch, gpu-burn.
Attachment:
nvidia-bug-report.log
nvidia-bug-report.log (2.5 MB)