We have recently purchased two workstations and both of them have 4 x RTX 2080 Ti GPUs attached.
We started experiencing an issue from the first day. After we start training our models using Kaldi, we get the following error at some point and nvidia-smi indicates one of the GPUs as Err!
We have the same issue on both workstations. We tried Ubuntu 16-18-19 versions, Centos 7.6. We even upgraded the Kernel once to 5.1.
We have installed different versions of Cuda and GPU drivers. We ran GPU burn and upgraded our BIOS to the latest version. Nothing improved the situation.
We get this error from all GPUs but one GPU at a time. It is reproducible, it sometimes takes few hours to reproduce this issue, sometimes over 24 hours, but we always get this error and we have to restart the workstation in order to recover from this situation
nvidia-bug-report.log.gz (1.96 MB)