Reg: GPU reset event displayed in IPMI event log

Hi,
I am currently using 4 DGX-1 systems & I heard from end users that one of GPU is having some issue.

Upon checking server IPMI log it shows

218	08/31/2023 01:51:17	Information	OEM Record c3	000000	GPU4 - GPU reset is not requested - State Flag: 00
217	08/31/2023 01:51:12	Critical	OEM Record c3	000000	GPU4 - GPU reset is requested - State Flag: 01

I would like to know how to evaluate and decide whether any hardware trouble in GPU4.


FYI: Received following output upon dcgmi diag execution

>>> dcgmi diag -g 0 -r 4
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 3.1.8                                          |
| Driver Version Detected   | 535.104.12                                     |
| GPU Device IDs Detected   | 1db5,1db5,1db5,1db5,1db5,1db5,1db5,1db5        |
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Pass                                           |
+-----  Integration  -------+------------------------------------------------+
| PCIe                      | Pass - All                                     |
+-----  Hardware  ----------+------------------------------------------------+
| GPU Memory                | Pass - All                                     |
| Diagnostic                | Pass - All                                     |
| Pulse Test                | Pass - All                                     |
+-----  Stress  ------------+------------------------------------------------+
| Targeted Stress           | Pass - All                                     |
| Targeted Power            | Pass - All                                     |
| Memory Bandwidth          | Pass - All                                     |
| Memtest                   | Pass - All                                     |
| EUD Test                  | Skip - All                                     |
+---------------------------+------------------------------------------------+

However, no potential information identified.

I would like to know how I can perform EUD test too, looks it’s skipped

I did tried following too

dcgmi diag -r eud
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 3.1.8                                          |
| Driver Version Detected   | 535.104.12                                     |
| GPU Device IDs Detected   | 1db5,1db5,1db5,1db5,1db5,1db5,1db5,1db5        |
|-----  Deployment  --------+------------------------------------------------|
+-----  Integration  -------+------------------------------------------------+
+-----  Hardware  ----------+------------------------------------------------+
+-----  Stress  ------------+------------------------------------------------+
| EUD Test                  | Skip - All                                     |
+---------------------------+------------------------------------------------+

image

How to solve it? I also encountered similar problems here

1 Like