An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 5, ...

MarkusD · October 24, 2016, 9:38am

Hello.
we use an K80 grafikcard and the opencl 1.1 library. The server is an proliant 380 G9 server with RedHat 6.5 on it. When a college kills a thread with kill -9 on the server the grafikkard creates this error and the server stops. The server was then in an unconnectable state. We must restart the server so solve this problem. In the log we found this:

Oct 17 14:22:26 node0 kernel: NVRM: GPU at PCI:0000:86:00: GPU-4214893c-f01f-7bf8-0583-a45f2746bbb4
Oct 17 14:22:26 node0 kernel: NVRM: GPU Board Serial Number: 0325214065534
Oct 17 14:22:26 node0 kernel: NVRM: Xid (PCI:0000:86:00): 48, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 5, subpartition 0
Oct 17 14:22:26 node0 kernel:
Oct 17 14:22:26 node0 kernel: NVRM: Xid (PCI:0000:86:00): 62, 13e9(2468) 00000000 00000000

Has anyone an idea what has happend here?

Best Regards
Markus

njuffa · October 24, 2016, 5:00pm

I will assume this is a server with vendor-integrated K80, not some jury-rigged home-brew configuration, and that the server is not being operated in some harsh environment (e.g. near a powerful source of EMI or RFI).

The K80 implements ECC with SECDET (single error correct, double error detect). That means it is capable of correcting single bit errors (e.g. due to cosmic ray flipping one bit) which are logged. nvidia-smi can show you the current count of such corrections. While double-bit errors cannot be corrected due to lack of sufficient information, they can be detected. So as not to silently continue operation with incorrect data, the GPU is halted. In that respect it works much the same as a server with ECC-protected system memory.

I do not see how a double-bit error could be caused by the killing of the thread, but there may be indirect linkage as follows: killing the thread triggered a tear-down of a GPU context connected with that thread, during which the corrupted data was accessed, triggering the double-bit event on the GPU.

I think what you would want to do is look closely at the ECC error statistics for this card: are there many single-bit errors recorded in addition to the one double-error event, and do the counts continue to increase under further usage? If so, I would contact the system vendor, as this could indicate a problem with the memory on the GPU and you may need to replace the GPU. All electronic devices physically age due to multiple failure mechanisms and this can lead to device failure. The aging process is accelerated by operating at elevated temperatures.

Topic		Replies	Views
An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 5, subpartition 0 Linux	2	1197	October 24, 2016
P40 - Getting "ECC Double Bit Error" GPU - Hardware cuda , kernel , drive-hardware-setup , gpu	1	1053	April 23, 2024
Uncorrectable double bit error Linux	0	1015	May 26, 2021
What to do with GPUs with ECC errors? Linux linux , gpu-computing	1	657	January 27, 2025
Handling Double Bit Exceptions in Tensorflow CUDA Programming and Performance	6	1153	December 5, 2018
Volatile Uncorr. ECC Linux	1	49	May 25, 2026
Tesla C2050: how are double bit ECC errors handled CUDA Programming and Performance	2	8536	August 20, 2010
Tool to find out the cause of CUDA error CUDA Setup and Installation	6	5509	August 12, 2020
Why double bit ecc error count is not match to retired pages count CUDA-MEMCHECK	2	1616	February 28, 2022
Strange ECC mode reported by nvidia-smi.exe CUDA Programming and Performance	6	8806	November 15, 2018

An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 5, ...

Related topics