Ubuntu 18.04 freezed when using gpu-burn on RTX2080 Ti

Environments and configurations:

  • ubuntu 18.04
  • 4 2080ti GPUs plugged and used in the early month this year. Later, 2 GPUs are unplgged and nvidia driver is re-installed as attempts to solve the system-freezing problem.
  • Nvidia driver version: 440.36
  • CUDA version: 10.2

Problem and my observation:

  • At the very beginning (Jan&Feb this year), everything works fine when using the four GPUs.
  • Later, I found when using the first GPU (GPU index:0) (run deep learning model training), the system will hang after several training epochs.
  • As time goes, the phenomenon become severe and the system hangs immediately as the training starts.
  • We tried to re-install nvidia driver, unplugged the first GPU, but the new GPU (index:0) again has this problem. The rest GPUs are stable. The temperature and power supply are normal.
  • Other attemps we tried that do not solve the problem:
    • using different CUDA verision from 10.0, 10.1 to 10.2.
    • using PyTorch, gpu-burn.

There’s something broken with your onboard Matrox graphics device:

mgag200 0000:03:00.0: Fatal error during GPU init: -6

so the Xserver is always starting and stopping, hammering your first Nvidia gpu. As a workaround, disable X start. Furthermore, nvidia-persistenced has to be started and continuously running.
If the matrox is broken, either rma server/board or disable it in bios and add another vga.