I’ve already installed and verified GDS successfully . But when running my cuda code it’s always showing the error about 'cudaErrorECCUncorrectable = 214 ’ that never showed up before. I wonder if it’s about the mismatch between the GPU and nvidia driver or the cuda driver because when running the code on 3080Ti and cuda-11.7 the error never showed up. And if it’s not the reason how can I solve it?
(cuda:11.7 , nvidia driver:515.43.04 , Mellanox: ConnectX5 , MLNX_OFED: MLNX_OFED_LINUX-5.3-1.0.0.1-ubuntu18.04-x86_64,uname -r (kernel): 5.4.0-70-generic , SSD: optane NVMe SSD , GPU:P100 )
ps:1.The ECC error doesn’t disappear after a cold boot. The code can run a little while but the same error occurs soon. 2.The nvidia-smi result looks fine.(as follows)
nvidia-bug-report.log.gz (653.5 KB)
The RTX 3080Ti is a consumer GPU that does not support ECC, so obviously no ECC errors will be reported there.
The nvidia-smi
overview you included cannot tell the whole story. After your app fails with a ECC error, run nvidia-smi -q
on the Tesla P100 and then copy the following section of its output here:
Ecc Mode
Current :
Pending :
ECC Errors
Volatile
SRAM Correctable :
SRAM Uncorrectable :
DRAM Correctable :
DRAM Uncorrectable :
Aggregate
SRAM Correctable :
SRAM Uncorrectable :
DRAM Correctable :
DRAM Uncorrectable :
Retired Pages
Single Bit ECC :
Double Bit ECC :
Pending Page Blacklist :
Remapped Rows :
Uncorrectable ECC errors must be cleared manually by explicit user action using nvidia-smi
. CUDA will refuse to establish a context on the GPU until that happens.
I run nvidia-smi -q and the results are as follows.
It doesn’t look to me like ECC is enabled on that P100.
I don’t have any suggestions about why that error is occurring. Without better understanding of the test case I don’t think I could explain that error.
One thing to check is make sure the P100 is not overheating.
I have never seen CUDA report an ECC error for a GPU that has ECC disabled per the output from nvidia-smi
shown here. I guess it is possible that the CUDA runtime could get corrupted by out-of-bounds writes from a CUDA application linked to it, leading to bogus CUDA status returns. In that case valgrind
may help finding such writes.