ECC error occurs when running cuda code on P100

I’ve already installed and verified GDS successfully . But when running my cuda code it’s always showing the error about 'cudaErrorECCUncorrectable = 214 ’ that never showed up before. I wonder if it’s about the mismatch between the GPU and nvidia driver or the cuda driver because when running the code on 3080Ti and cuda-11.7 the error never showed up. And if it’s not the reason how can I solve it?

cuda:11.7 , nvidia driver:515.43.04 , Mellanox: ConnectX5 , MLNX_OFED: MLNX_OFED_LINUX-5.3-1.0.0.1-ubuntu18.04-x86_64,uname -r (kernel): 5.4.0-70-generic , SSD: optane NVMe SSD , GPU:P100
ps:1.The ECC error doesn’t disappear after a cold boot. The code can run a little while but the same error occurs soon. 2.The nvidia-smi result looks fine.(as follows)
image
nvidia-bug-report.log.gz (653.5 KB)

The RTX 3080Ti is a consumer GPU that does not support ECC, so obviously no ECC errors will be reported there.

The nvidia-smi overview you included cannot tell the whole story. After your app fails with a ECC error, run nvidia-smi -q on the Tesla P100 and then copy the following section of its output here:

    Ecc Mode
        Current                           : 
        Pending                           : 
    ECC Errors
        Volatile
            SRAM Correctable              : 
            SRAM Uncorrectable            : 
            DRAM Correctable              : 
            DRAM Uncorrectable            : 
        Aggregate
            SRAM Correctable              : 
            SRAM Uncorrectable            : 
            DRAM Correctable              : 
            DRAM Uncorrectable            : 
    Retired Pages
        Single Bit ECC                    : 
        Double Bit ECC                    : 
        Pending Page Blacklist            : 
    Remapped Rows                         : 

Uncorrectable ECC errors must be cleared manually by explicit user action using nvidia-smi. CUDA will refuse to establish a context on the GPU until that happens.

I run nvidia-smi -q and the results are as follows.

It doesn’t look to me like ECC is enabled on that P100.

I don’t have any suggestions about why that error is occurring. Without better understanding of the test case I don’t think I could explain that error.

One thing to check is make sure the P100 is not overheating.

I have never seen CUDA report an ECC error for a GPU that has ECC disabled per the output from nvidia-smi shown here. I guess it is possible that the CUDA runtime could get corrupted by out-of-bounds writes from a CUDA application linked to it, leading to bogus CUDA status returns. In that case valgrind may help finding such writes.