P40 - Getting "ECC Double Bit Error"

Dear NVIDIA Support,

Hello Everybody, I am reaching out to report a recurring issue with my GPU, identified with ID 1, model P40, on an Ubuntu 20.04 operating system, using CUDA driver version 510.39.01 and NVIDIA Driver version 11.6.

The specific problem we have been encountering is the occurrence of ECC Double Bit ERROR. The Linux kernel logs report the detection of an uncorrectable double bit error (DBE) on GPU in the framebuffer at partition 1, subpartition 1, as shown below:

1 Ubuntu Version 20.04
2 Driver CUDA Version - 510.39.01 - Driver NVIDIA Version 11.6

3 GPU MODEL : P40

4 Description: We have been experiencing issues with our GPU device, consistently occurring on the same device (ID 1), involving ECC Double Bit ERROR

5 Logs
kernel log (Linux)
Nov 29 14:00:09 vital kernel: [33331410.108566] NVRM: GPU at PCI:0000:5b:00: GPU-5a6fcd8c-a96b-4fe4-3528-63316ed7edb3
Nov 29 14:00:09 vital kernel: [33331410.108571] NVRM: Xid (PCI:0000:5b:00): 48, pid=3965666, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 1, subpartition 1.
Dec 1 11:42:46 vital kernel: [33495956.101014] NVRM: Xid (PCI:0000:5b:00): 48, pid=3877529, Ch 00000008

6 DCGM

└──>> dcgmi diag -r 3 -i 1 -v
Successfully ran diagnostic for group.
±--------------------------±-----------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Metadata ----------±-----------------------------------------------|
| DCGM Version | 3.3.1 |
| Driver Version Detected | 510.39.01 |
| GPU Device IDs Detected | 1b38 |
|----- Deployment --------±-----------------------------------------------|
| Denylist | Pass |
| NVML Library | Pass |
| CUDA Main Library | Pass |
| Permissions and OS Blocks | Pass |
| Persistence Mode | Pass |
| Info | Persistence mode for GPU 1 is disabled. Enabl |
| | e persistence mode by running "nvidia-smi -i |
| | -pm 1 " as root. |
| Environment Variables | Pass |
| Page Retirement/Row Remap | Fail |
| Error | A pending retired page has been detected in G |
| | PU 1. Monitor - this GPU can still perform wo |
| | rkload |
| Graphics Processes | Pass |
| Inforom | Pass |
±---- Integration -------±-----------------------------------------------+
| PCIe | Pass - All |
| Info | GPU 1 GPU to Host bandwidth: 13.03 GB/s, GPU |
| | 1 Host to GPU bandwidth: 12.29 GB/s, GPU 1 |
| | bidirectional bandwidth: 22.61 GB/s, GPU 1 GP |
| | U to Host latency: 1.527 us, GPU 1 Host to G |
| | PU latency: 1.559 us, GPU 1 bidirectional la |
| | tency: 2.945 us |
±---- Hardware ----------±-----------------------------------------------+
| GPU Memory | Pass - All |
| Info | GPU 1 Allocated 23637314745 bytes (98.4%) |
| Diagnostic | Pass - All |
| Info | GPU 1 Allocated space for 636 output matricie |
| | s from 21490355404 bytes available., GPU 1 Ru |
| | nning with precisions: FP64 0, FP32 1, FP16 1 |
| | , GPU 1 GPU 1 calculated at approximately 226 |
| | .24 gigaflops during this test |
±---- Stress ------------±-----------------------------------------------+
| Targeted Stress | Pass - All |
| Info | GPU 1 GPU 1 relative stress level 3532 |
| Targeted Power | Pass - All |
| Info | GPU 1 GPU 1 max power: 241.8 W average power |
| | usage: 230.9 W |
| Memory Bandwidth | Skip - All |
| EUD Test | Skip - All |
±--------------------------±-----------------------------------------------+

nvidia-smi output when occurs the error :
±----------------------------------------------------------------------------+
| NVIDIA-SMI 510.39.01 Driver Version: 510.39.01 CUDA Version: 11.6 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:25:00.0 Off | 0 |
| N/A 42C P0 65W / 250W | 1251MiB / 23040MiB | 57% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Tesla P40 Off | 00000000:5B:00.0 Off | 2 |
| N/A 25C P8 9W / 250W | 2MiB / 23040MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1547355 C java 1249MiB |
±----------------------------------------------------------------------------+

8 which app did failed?
BEAST v1.04.
CUDA error: “Unknown error” (214) from file </usr/local/src/beagle-lib/4.0.0/beagle-lib-4.0.0/libhmsbeagle/GPU/GPUInterfaceCUDA.cpp>, line 495.

I appreciate your attention and assistance in resolving this issue. If you need more information or details, I am available to provide the necessary support.

Best regards,

image

How to solve it? I also encountered similar problems here