What to do with GPUs with ECC errors?

JGuillaumin · January 27, 2025, 10:24am

Hi all,

I have some GPUs with ECC errors :

1 V100-PCI-32GB
3 V100-SXM2-32GB

What can I do with them ?
Is there some tools to understand from where is coming errors ?

For the V100-PCI-32GB (on a server with 8 GPUs like this):

In syslog while trying to run gpu-burn :

2025-01-17T12:25:53.001011+00:00 pangu kernel: NVRM: GPU at PCI:0000:3e:00: GPU-1a31db83-2ae5-cb92-28f0-9cdd554da5e0
2025-01-17T12:25:53.001036+00:00 pangu kernel: NVRM: GPU Board Serial Number: 1563320002204
2025-01-17T12:25:53.001041+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at physAddr 0x401d671a0 partition 7, subpartition 1.
2025-01-17T12:25:53.024000+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 00000008
2025-01-17T12:25:53.027994+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 00000009
2025-01-17T12:25:53.028974+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 0000000a
2025-01-17T12:25:53.029925+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 0000000b
2025-01-17T12:25:53.030920+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 0000000c
2025-01-17T12:25:53.031921+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 0000000d
2025-01-17T12:25:53.031926+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 0000000e
2025-01-17T12:25:53.032921+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 48, pid='<unknown>', name=<unknown>, Ch 0000000f
2025-01-17T12:25:53.039995+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 63, pid='<unknown>', name=<unknown>, Dynamic Page Retirement: New page retired, reboot to activate (0x0000000000401d67).
2025-01-17T12:25:53.065022+00:00 pangu kernel: NVRM: Xid (PCI:0000:3e:00): 154, pid='<unknown>', name=<unknown>, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)

Even with nvidia-smi --gpu-reset -i 4 or with

nvidia-smi -i 4 -q -d PAGE_RETIREMENT

==============NVSMI LOG==============

Timestamp                 : Fri Jan 17 13:31:43 2025
Driver Version              : 565.57.01
CUDA Version               : 12.7

Attached GPUs               : 5
GPU 00000000:B1:00.0
  Retired Pages
    Single Bit ECC          : 0
    Double Bit ECC          : 10
    Pending Page Blacklist      : N

I have the same error.

For the V100-SXM2-32GB :

Impossible to run gpu-burn, nor nvidia-docker. I tried to flash a recent vBIOS. Impossible.

Adapter not accessible or supported EEPROM not found, skipping

NOTE: Exception caught.

Results:
 Index | Match | Flash | Name 
  <00>                   Tesla V100-SXM2-32GB (10DE,1DB5,10DE,1249) S:00, B:DF
Nothing changed!

ERROR:  Detecting GPU failed.

From syslog I have this kind of error messages :

NVRM: Xid (PCI:0000:df:00): 48, pid='<unknown>', name=<unknown>, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at physAddr 0x7fdf5c2a0 partition 9, subpartition 1.

More information about the systems :

Ubuntu 24.04 server
KERNEL_UNAME=6.8.0-51-generic
Driver version : 565.57.01

Does it means the GPUs are physically damaged ?

rs277 · January 27, 2025, 6:06pm

Yes, some cells are no longer reliable. As long as the pages are retired, the card’s safe to be reused.

Details can be found here.

Topic		Replies	Views
P40 - Getting "ECC Double Bit Error" GPU - Hardware cuda , kernel , drive-hardware-setup , gpu	1	997	April 23, 2024
V100 ECC Error Linux	15	3475	May 15, 2020
ECC error occurs when running cuda code on P100 CUDA Programming and Performance cuda	4	5828	July 1, 2022
An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 5, ... CUDA Setup and Installation	1	4959	October 24, 2016
Tool to find out the cause of CUDA error CUDA Setup and Installation	7	5379	October 12, 2021
uncorrected ECC - how to get back on track Linux	5	9010	February 13, 2017
Nvidia Tesla P100 keeps throwing ECC errors CUDA Programming and Performance cuda , ubuntu , driver	2	755	July 2, 2024
Why double bit ecc error count is not match to retired pages count CUDA-MEMCHECK	2	1575	February 28, 2022
Dynamic Page Retirement encounted Fatal Error on V100 Linux hw , nvbugs	0	263	May 2, 2024
Uncorrectable double bit error Linux	0	986	May 26, 2021

What to do with GPUs with ECC errors?

Related topics