"BUG: unable to handle kernel NULL pointer dereference at 00000000000000b1" error happens on Ubuntu16.04 with NVIDIA Driver 465.31, 470.63.01, 470.74

Hi, we just bought two machines. One of them has 2080Ti * 8 installed, and the system is Ubuntu 16.04.7, kernel version is 4.15.0-142. The other one has 3090 * 10 installed and the system is Ubuntu 16.04.7, kernel version is 4.15.0-142.
The issue often occurs on the completion of training job or when we interrupt one job and try starting a new one. The system just freezes somehow.

Below is the error that we can find in the logs, it was reported every time the issue happened.

BUG: unable to handle kernel NULL pointer dereference at 00000000000000b1

IP: _nv031699rm+0x79/0x940 [nvidia]

We tried NVIDIA Driver 465.31, 470.63.01 and 470.74 on the two machines, but the issue was not resolved.

I have same issue,kernel always panic when we training a long time(one day or two days). We use lxd to isolate per gpu and user.
Our system information is as fellow:
Linux b605gpu 4.15.0-162-generic #170-Ubuntu SMP Mon Oct 18 11:38:05 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Oct_11_21:27:02_PDT_2021
Cuda compilation tools, release 11.4, V11.4.152
Build cuda_11.4.r11.4/compiler.30521435_0
NVIDIA-SMI 470.86 Driver Version: 470.86 CUDA Version: 11.4