One of our lab’s servers, configured with 8 RTX 3090 GPUs, has recently been experiencing the following issue frequently.
The Linux system is Ubuntu 18.04, and the driver version is 470.182.03.
When I run nvidia-smi
, I receive the error: Unable to determine the device handle for GPU 0000:81:00.0: Unknown Error
Restarting the machine resolves the issue temporarily, but it reoccurs after a short period.
Using nvidia-debugdump --list
, I get the following output:
Found 8 NVIDIA devices
Device ID: 0
Device name: NVIDIA GeForce RTX 3090
GPU internal ID: GPU-cdee12ce-7104-5850-8d12-69f932bd655e
Device ID: 1
Device name: NVIDIA GeForce RTX 3090
GPU internal ID: GPU-39e17c11-e874-3dcb-d4d0-df16a96c30f9
Device ID: 2
Device name: NVIDIA GeForce RTX 3090
GPU internal ID: GPU-37b864a4-5b52-d9cf-894b-c8062a3fb2b7
Device ID: 3
Device name: NVIDIA GeForce RTX 3090
GPU internal ID: GPU-179804e0-ac39-0fab-04d1-4edc2da8fa45
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x4): Unknown Error
The log file is as follows:
nvidia-bug-report.log (5.2 MB)
Requesting assistance to resolve this issue.