ECC error occurs when running cuda code on P100

951595473 · July 1, 2022, 12:50pm

I’ve already installed and verified GDS successfully . But when running my cuda code it’s always showing the error about 'cudaErrorECCUncorrectable = 214 ’ that never showed up before. I wonder if it’s about the mismatch between the GPU and nvidia driver or the cuda driver because when running the code on 3080Ti and cuda-11.7 the error never showed up. And if it’s not the reason how can I solve it?

（cuda:11.7 , nvidia driver:515.43.04 , Mellanox: ConnectX5 , MLNX_OFED: MLNX_OFED_LINUX-5.3-1.0.0.1-ubuntu18.04-x86_64，uname -r (kernel): 5.4.0-70-generic , SSD: optane NVMe SSD , GPU:P100 ）
ps:1.The ECC error doesn’t disappear after a cold boot. The code can run a little while but the same error occurs soon. 2.The nvidia-smi result looks fine.(as follows)

nvidia-bug-report.log.gz (653.5 KB)

njuffa · July 1, 2022, 1:05pm

The RTX 3080Ti is a consumer GPU that does not support ECC, so obviously no ECC errors will be reported there.

The nvidia-smi overview you included cannot tell the whole story. After your app fails with a ECC error, run nvidia-smi -q on the Tesla P100 and then copy the following section of its output here:

    Ecc Mode
        Current                           : 
        Pending                           : 
    ECC Errors
        Volatile
            SRAM Correctable              : 
            SRAM Uncorrectable            : 
            DRAM Correctable              : 
            DRAM Uncorrectable            : 
        Aggregate
            SRAM Correctable              : 
            SRAM Uncorrectable            : 
            DRAM Correctable              : 
            DRAM Uncorrectable            : 
    Retired Pages
        Single Bit ECC                    : 
        Double Bit ECC                    : 
        Pending Page Blacklist            : 
    Remapped Rows                         :

Uncorrectable ECC errors must be cleared manually by explicit user action using nvidia-smi. CUDA will refuse to establish a context on the GPU until that happens.

951595473 · July 1, 2022, 2:38pm

I run nvidia-smi -q and the results are as follows.

Robert_Crovella · July 1, 2022, 6:11pm

It doesn’t look to me like ECC is enabled on that P100.

I don’t have any suggestions about why that error is occurring. Without better understanding of the test case I don’t think I could explain that error.

One thing to check is make sure the P100 is not overheating.

njuffa · July 1, 2022, 9:36pm

I have never seen CUDA report an ECC error for a GPU that has ECC disabled per the output from nvidia-smi shown here. I guess it is possible that the CUDA runtime could get corrupted by out-of-bounds writes from a CUDA application linked to it, leading to bogus CUDA status returns. In that case valgrind may help finding such writes.

Topic		Replies	Views
Cuda error in file '*.cu' in line 112 : uncorrectable ECC error encountered. CUDA VS2008 CUD CUDA Programming and Performance	3	2933	February 14, 2012
Nvidia Tesla P100 keeps throwing ECC errors CUDA Programming and Performance cuda , ubuntu , driver	2	561	July 2, 2024
A800 ECC error, I using nvidia-report-bug.sh generated log Linux	3	786	December 11, 2023
Question about ECC memory resiliency CUDA Programming and Performance	4	875	June 25, 2019
ECC Errors with quad Fermi C2070 CUDA Programming and Performance	2	23790	March 24, 2011
Tool to find out the cause of CUDA error CUDA Setup and Installation	7	4933	October 12, 2021
P100 not showing up in nvidia-smi CUDA Setup and Installation	17	9110	November 20, 2022
A100 GPU on GCP: "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.", "Found no NVIDIA driver on your system." CUDA Setup and Installation cuda , python , linux , driver	0	2196	October 21, 2022
Which step leads me fail to find my NVIDIA card during installing CUDA and cuDNN? CUDA Setup and Installation	19	4427	May 2, 2017
Tesla p100 hardware incompatability? Linux cuda	0	718	July 19, 2023

ECC error occurs when running cuda code on P100

Related topics