Hi all,
I have some GPUs with ECC errors :
- 1 V100-PCI-32GB
- 3 V100-SXM2-32GB
What can I do with them ?
Is there some tools to understand from where is coming errors ?
For the V100-PCI-32GB (on a server with 8 GPUs like this):
- In
syslog
while trying to rungpu-burn
:
2025-01-17T12:25:53.001011+00:00 pangu kernel: NVRM: GPU at PCI:0000:3e:00: GPU-1a31db83-2ae5-cb92-28f0-9cdd554da5e0
2025-01-17T12:25:53.001036+00:00 pangu kernel: NVRM: GPU Board Serial Number: 1563320002204
2025-01-17T12:25:53.001041+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at physAddr 0x401d671a0 partition 7, subpartition 1.
2025-01-17T12:25:53.024000+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 00000008
2025-01-17T12:25:53.027994+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 00000009
2025-01-17T12:25:53.028974+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 0000000a
2025-01-17T12:25:53.029925+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 0000000b
2025-01-17T12:25:53.030920+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 0000000c
2025-01-17T12:25:53.031921+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 0000000d
2025-01-17T12:25:53.031926+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 0000000e
2025-01-17T12:25:53.032921+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 0000000f
2025-01-17T12:25:53.039995+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 63, pid='<unknown>', name=<unknown>, Dynamic Page Retirement: New page retired, reboot to activate (0x0000000000401d67).
2025-01-17T12:25:53.065022+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 154, pid='<unknown>', name=<unknown>, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
- Even with
nvidia-smi --gpu-reset -i 4
or with
nvidia-smi -i 4 -q -d PAGE_RETIREMENT
==============NVSMI LOG==============
Timestamp : Fri Jan 17 13:31:43 2025
Driver Version : 565.57.01
CUDA Version : 12.7
Attached GPUs : 5
GPU 00000000:B1:00.0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 10
Pending Page Blacklist : N
I have the same error.
For the V100-SXM2-32GB :
- Impossible to run
gpu-burn
, nornvidia-docker
. I tried to flash a recent vBIOS. Impossible.
Adapter not accessible or supported EEPROM not found, skipping
NOTE: Exception caught.
Results:
Index | Match | Flash | Name
<00> Tesla V100-SXM2-32GB (10DE,1DB5,10DE,1249) S:00, B:DF
Nothing changed!
ERROR: Detecting GPU failed.
- From
syslog
I have this kind of error messages :
NVRM: Xid (PCI:0000:df:00): 48, pid='<unknown>', name=<unknown>, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at physAddr 0x7fdf5c2a0 partition 9, subpartition 1.
More information about the systems :
- Ubuntu 24.04 server
- KERNEL_UNAME=6.8.0-51-generic
- Driver version : 565.57.01
Does it means the GPUs are physically damaged ?