"uncorrectable ECC error encountered"

All,

I’m doing some testing with PGI 14.1 CUDA Fortran here and I’ve found a few issues, but managed to work around them…huzzah! However, today I was doing some Hyper-Q tests going from 4 processors to 8 processors to 24 processors on a couple K20x cards.

Now the 4 and 8 processor runs went just fine, but when I tried the 24 processor (12 CPU cores per GPU) I had an odd error appear. In my code I do this:

     STATUS = cudaDeviceSetCacheConfig(cudaFuncCachePreferL1)
     if (STATUS /= 0) then
        write (*,*) "cudaDeviceSetCacheConfig failed: ", cudaGetErrorString(STATUS)
        ASSERT_(.FALSE.)
     end if

Now whether this is needed or not anymore is debatable, but, well, it hasn’t hurt. But this time it spit out:

 cudaDeviceSetCacheConfig failed: 
 uncorrectable ECC error encountered

It didn’t do this with the 4 or 8 core job, which ran the exact same code. And I’ve done 24 processors before with PGI 13.10, so it’s not like it can’t do it.

Has anyone out there ever seen this? I tried looking at “nvidia-smi -a” to see if the cards were going nuts and yet they both look like:

    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No

So before I try and engage PGI’s parent about this, I thought I’d ask here if anyone had ever encountered it before.

Thanks,
Matt

Hi Matt,

I’ve never encountered this myself but did have one other report on the PGI UF a little over two years ago. I had asked the NVIDIA folks at the time and they thought it was most likely a bad board. Though in your case, I’m not sure if this is true.

Are you using the same system that you used when you built with 13.10?

If so, can you try building with 13.10 again? (Maybe the boards have failed since you last used them?) If not, can you try running the 14.1 build on the other system?

If nothing changed but the compiler version, then I’ll ask around and see what else could cause this.

  • Mat