What to do with GPUs with ECC errors?

Hi all,

I have some GPUs with ECC errors :

  • 1 V100-PCI-32GB
  • 3 V100-SXM2-32GB

What can I do with them ?
Is there some tools to understand from where is coming errors ?

For the V100-PCI-32GB (on a server with 8 GPUs like this):

  • In syslog while trying to run gpu-burn :
2025-01-17T12:25:53.001011+00:00 pangu kernel: NVRM: GPU at PCI:0000:3e:00: GPU-1a31db83-2ae5-cb92-28f0-9cdd554da5e0
2025-01-17T12:25:53.001036+00:00 pangu kernel: NVRM: GPU Board Serial Number: 1563320002204
2025-01-17T12:25:53.001041+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at physAddr 0x401d671a0 partition 7, subpartition 1.
2025-01-17T12:25:53.024000+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 00000008
2025-01-17T12:25:53.027994+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 00000009
2025-01-17T12:25:53.028974+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 0000000a
2025-01-17T12:25:53.029925+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 0000000b
2025-01-17T12:25:53.030920+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 0000000c
2025-01-17T12:25:53.031921+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 0000000d
2025-01-17T12:25:53.031926+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 0000000e
2025-01-17T12:25:53.032921+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 0000000f
2025-01-17T12:25:53.039995+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 63, pid='<unknown>', name=<unknown>, Dynamic Page Retirement: New page retired, reboot to activate (0x0000000000401d67).
2025-01-17T12:25:53.065022+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 154, pid='<unknown>', name=<unknown>, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
  • Even with nvidia-smi --gpu-reset -i 4 or with
nvidia-smi -i 4 -q -d PAGE_RETIREMENT

==============NVSMI LOG==============

Timestamp                 : Fri Jan 17 13:31:43 2025
Driver Version              : 565.57.01
CUDA Version               : 12.7

Attached GPUs               : 5
GPU 00000000:B1:00.0
  Retired Pages
    Single Bit ECC          : 0
    Double Bit ECC          : 10
    Pending Page Blacklist      : N

I have the same error.

For the V100-SXM2-32GB :

  • Impossible to run gpu-burn, nor nvidia-docker. I tried to flash a recent vBIOS. Impossible.
Adapter not accessible or supported EEPROM not found, skipping

NOTE: Exception caught.

Results:
 Index | Match | Flash | Name 
  <00>                   Tesla V100-SXM2-32GB (10DE,1DB5,10DE,1249) S:00, B:DF
Nothing changed!

ERROR:  Detecting GPU failed.
  • From syslog I have this kind of error messages :
NVRM: Xid (PCI:0000:df:00): 48, pid='<unknown>', name=<unknown>, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at physAddr 0x7fdf5c2a0 partition 9, subpartition 1.

More information about the systems :

  • Ubuntu 24.04 server
  • KERNEL_UNAME=6.8.0-51-generic
  • Driver version : 565.57.01

Does it means the GPUs are physically damaged ?

Yes, some cells are no longer reliable. As long as the pages are retired, the card’s safe to be reused.

Details can be found here.