Hello,
During the GPU training process, we frequently encounter the issue of the GPU freezing. As a result, we have reinstalled the system and the GPU drivers. However, recently, the GPU still drops out occasionally. We have ruled out temperature issues but still cannot identify the root cause. Therefore, we are attaching the report and the information we have collected. We appreciate your time and assistance in diagnosing the GPU drop issue.
We also noticed that when this issue occurs, we are unable to reboot remotely because the GPU never unloads, which is particularly inconvenient when working remotely—it requires a physical shutdown.
Additional Information:
The output from nvidia-smi
shows:
| 6 NVIDIA GeForce RTX 3090 Off | 00000000:23:00.0 N/A | N/A |
|ERR! ERR! ERR! N/A / N/A | 23823MiB / 24576MiB | N/A Default |
| | | ERR! |
The log messages indicate:
name=python, Timeout after 45s of waiting for RPC response from GPU6 GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801702 0x4).
name=python, Timeout after 45s of waiting for RPC response from GPU6 GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a6a 0x0).
U at PCI:0000:23:00 (printing 1 of every 30). The GPU likely needs to be reset.
However, when I attempted to reset the GPU using $ sudo nvidia-smi --gpu-reset -i 6
, it returned:
The following GPUs could not be reset:
GPU 00000000:23:00.0: Not Supported
Driver Version: 570.86.15 NVIDIA RTX 3090
nvidia-bug-report.log.gz (10.9 MB)