Hi, we just bought two machines. One of them has 2080Ti * 8 installed, and the system is Ubuntu 16.04.7, kernel version is 4.15.0-142. The other one has 3090 * 10 installed and the system is Ubuntu 16.04.7, kernel version is 4.15.0-142.
The issue often occurs on the completion of training job or when we interrupt one job and try starting a new one. The system just freezes somehow.
Below is the error that we can find in the logs, it was reported every time the issue happened.
BUG: unable to handle kernel NULL pointer dereference at 00000000000000b1
IP: _nv031699rm+0x79/0x940 [nvidia]
We tried NVIDIA Driver 465.31, 470.63.01 and 470.74 on the two machines, but the issue was not resolved.