"uncorrectable ECC error encountered"

TheMatt · February 4, 2014, 7:24pm

All,

I’m doing some testing with PGI 14.1 CUDA Fortran here and I’ve found a few issues, but managed to work around them…huzzah! However, today I was doing some Hyper-Q tests going from 4 processors to 8 processors to 24 processors on a couple K20x cards.

Now the 4 and 8 processor runs went just fine, but when I tried the 24 processor (12 CPU cores per GPU) I had an odd error appear. In my code I do this:

     STATUS = cudaDeviceSetCacheConfig(cudaFuncCachePreferL1)
     if (STATUS /= 0) then
        write (*,*) "cudaDeviceSetCacheConfig failed: ", cudaGetErrorString(STATUS)
        ASSERT_(.FALSE.)
     end if

Now whether this is needed or not anymore is debatable, but, well, it hasn’t hurt. But this time it spit out:

 cudaDeviceSetCacheConfig failed: 
 uncorrectable ECC error encountered

It didn’t do this with the 4 or 8 core job, which ran the exact same code. And I’ve done 24 processors before with PGI 13.10, so it’s not like it can’t do it.

Has anyone out there ever seen this? I tried looking at “nvidia-smi -a” to see if the cards were going nuts and yet they both look like:

    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No

So before I try and engage PGI’s parent about this, I thought I’d ask here if anyone had ever encountered it before.

Thanks,
Matt

MatColgrove · February 4, 2014, 9:01pm

Hi Matt,

I’ve never encountered this myself but did have one other report on the PGI UF a little over two years ago. I had asked the NVIDIA folks at the time and they thought it was most likely a bad board. Though in your case, I’m not sure if this is true.

Are you using the same system that you used when you built with 13.10?

If so, can you try building with 13.10 again? (Maybe the boards have failed since you last used them?) If not, can you try running the 14.1 build on the other system?

If nothing changed but the compiler version, then I’ll ask around and see what else could cause this.

Mat

Topic		Replies	Views
ECC error occurs when running cuda code on P100 CUDA Programming and Performance cuda	4	5947	July 1, 2022
Importance of ECC memory Legacy PGI Compilers	2	2228	May 10, 2011
ECC Errors with quad Fermi C2070 CUDA Programming and Performance	2	23846	March 24, 2011
Uncorrectable ECC error CUDA Programming and Performance	9	13919	January 19, 2014
CUDA FORTRAN examples don't work for PGI19.4 Legacy PGI Compilers	7	3819	May 8, 2019
What to do with GPUs with ECC errors? Linux linux , gpu-computing	1	606	January 27, 2025
Nvidia Tesla P100 keeps throwing ECC errors CUDA Programming and Performance cuda , ubuntu , driver	2	813	July 2, 2024
cudamalloc error in tesla P4 card General	7	947	October 12, 2021
Tool to find out the cause of CUDA error CUDA Setup and Installation	7	5438	October 12, 2021
Accelerator Fatal Error: No NVIDIA/CUDA version... Legacy PGI Compilers	12	14858	May 15, 2017

"uncorrectable ECC error encountered"

Related topics