We have recently purchased two workstations and both of them have 4 x RTX 2080 Ti GPUs attached.
We started experiencing an issue from the first day. After we start training our models using Kaldi, we get the following error at some point and nvidia-smi indicates one of the GPUs as Err!
We have the same issue on both workstations. We tried Ubuntu 16-18-19 versions, Centos 7.6. We even upgraded the Kernel once to 5.1.
We have installed different versions of Cuda and GPU drivers. We ran GPU burn and upgraded our BIOS to the latest version. Nothing improved the situation.
We get this error from all GPUs but one GPU at a time. It is reproducible, it sometimes takes few hours to reproduce this issue, sometimes over 24 hours, but we always get this error and we have to restart the workstation in order to recover from this situation
from experience, the 2080ti’s are sensitive to heat, does this also occur if you’re only running with two of them with free space inbetween? Also monitor temperatures using nvidia-smi.
We installed headless linux without an X server but it didn’t help either.
from experience, the 2080ti’s are sensitive to heat, does this also occur if you’re only running with two of them with free space inbetween? Also monitor temperatures using nvidia-smi.
We haven’t tried this, but we tested with GPU burn multiple times and couldn’t reproduce the issue there. We’ll try this too. Thanks.