Hi,
I am currently using 4 DGX-1 systems & I heard from end users that one of GPU is having some issue.
Upon checking server IPMI log it shows
218 08/31/2023 01:51:17 Information OEM Record c3 000000 GPU4 - GPU reset is not requested - State Flag: 00
217 08/31/2023 01:51:12 Critical OEM Record c3 000000 GPU4 - GPU reset is requested - State Flag: 01
I would like to know how to evaluate and decide whether any hardware trouble in GPU4.
FYI: Received following output upon dcgmi diag execution
>>> dcgmi diag -g 0 -r 4
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Metadata ----------+------------------------------------------------|
| DCGM Version | 3.1.8 |
| Driver Version Detected | 535.104.12 |
| GPU Device IDs Detected | 1db5,1db5,1db5,1db5,1db5,1db5,1db5,1db5 |
|----- Deployment --------+------------------------------------------------|
| Denylist | Pass |
| NVML Library | Pass |
| CUDA Main Library | Pass |
| Permissions and OS Blocks | Pass |
| Persistence Mode | Pass |
| Environment Variables | Pass |
| Page Retirement/Row Remap | Pass |
| Graphics Processes | Pass |
| Inforom | Pass |
+----- Integration -------+------------------------------------------------+
| PCIe | Pass - All |
+----- Hardware ----------+------------------------------------------------+
| GPU Memory | Pass - All |
| Diagnostic | Pass - All |
| Pulse Test | Pass - All |
+----- Stress ------------+------------------------------------------------+
| Targeted Stress | Pass - All |
| Targeted Power | Pass - All |
| Memory Bandwidth | Pass - All |
| Memtest | Pass - All |
| EUD Test | Skip - All |
+---------------------------+------------------------------------------------+
However, no potential information identified.
I would like to know how I can perform EUD test too, looks it’s skipped
shan_8992:
| EUD Test | Skip - All
I did tried following too
dcgmi diag -r eud
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Metadata ----------+------------------------------------------------|
| DCGM Version | 3.1.8 |
| Driver Version Detected | 535.104.12 |
| GPU Device IDs Detected | 1db5,1db5,1db5,1db5,1db5,1db5,1db5,1db5 |
|----- Deployment --------+------------------------------------------------|
+----- Integration -------+------------------------------------------------+
+----- Hardware ----------+------------------------------------------------+
+----- Stress ------------+------------------------------------------------+
| EUD Test | Skip - All |
+---------------------------+------------------------------------------------+
How to solve it? I also encountered similar problems here
1 Like