Nvidia-smi error after a few minutes of up time

I have been noticing this error after my machine is up for a few moments.

sudo nvidia-smi
Unable to determine the device handle for GPU0000:08:00.0: Unknown Error

This is after plex was using the driver just fine a few minutes prior.

joshu@server:~$ sudo nvidia-smi
Mon Jan 23 11:46:40 2023
±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P4 On | 00000000:08:00.0 Off | 0 |
| N/A 87C P0 30W / 75W | 367MiB / 7680MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 6354 C …diaserver/Plex Transcoder 167MiB |
| 0 N/A N/A 7214 C …diaserver/Plex Transcoder 197MiB |
±----------------------------------------------------------------------------+

Here is a bug report. Note I think that the gpu is falling off the bus.

nvidia-bug-report.log.gz (152.5 KB)

dmesg is showing this:

[ 505.050774] NVRM: GPU at PCI:0000:08:00: GPU-06c5ac54-038b-1629-7e67-9c7f9a55c914
[ 505.050787] NVRM: GPU Board Serial Number: 0422818015279
[ 505.050791] NVRM: Xid (PCI:0000:08:00): 79, pid=‘’, name=, GPU has fallen off the bus.
[ 505.050796] NVRM: GPU 0000:08:00.0: GPU has fallen off the bus.
[ 505.050799] NVRM: GPU 0000:08:00.0: GPU serial number is 0422818015279.
[ 505.050814] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.

Then when I run nvidia-smi I get this:

joshu@server:~$ sudo nvidia-smi -pm 1
Unable to determine the device handle for GPU0000:08:00.0: Unknown Error

I wonder if its cooling. Per this post

Here are the last couple temps it reported before it ‘fell off the bus’

joshu@server:~$ sudo nvidia-smi
Mon Jan 23 11:46:39 2023
±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P4 On | 00000000:08:00.0 Off | 0 |
| N/A 87C P0 30W / 75W | 367MiB / 7680MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 6354 C …diaserver/Plex Transcoder 167MiB |
| 0 N/A N/A 7214 C …diaserver/Plex Transcoder 197MiB |
±----------------------------------------------------------------------------+
joshu@server:~$ sudo nvidia-smi
Mon Jan 23 11:46:40 2023
±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P4 On | 00000000:08:00.0 Off | 0 |
| N/A 87C P0 30W / 75W | 367MiB / 7680MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 6354 C …diaserver/Plex Transcoder 167MiB |
| 0 N/A N/A 7214 C …diaserver/Plex Transcoder 197MiB |
±----------------------------------------------------------------------------+
joshu@server:~$ sudo nvidia-smi
Unable to determine the device handle for GPU0000:08:00.0: Unknown Error