All,
I’m doing some testing with PGI 14.1 CUDA Fortran here and I’ve found a few issues, but managed to work around them…huzzah! However, today I was doing some Hyper-Q tests going from 4 processors to 8 processors to 24 processors on a couple K20x cards.
Now the 4 and 8 processor runs went just fine, but when I tried the 24 processor (12 CPU cores per GPU) I had an odd error appear. In my code I do this:
STATUS = cudaDeviceSetCacheConfig(cudaFuncCachePreferL1)
if (STATUS /= 0) then
write (*,*) "cudaDeviceSetCacheConfig failed: ", cudaGetErrorString(STATUS)
ASSERT_(.FALSE.)
end if
Now whether this is needed or not anymore is debatable, but, well, it hasn’t hurt. But this time it spit out:
cudaDeviceSetCacheConfig failed:
uncorrectable ECC error encountered
It didn’t do this with the 4 or 8 core job, which ran the exact same code. And I’ve done 24 processors before with PGI 13.10, so it’s not like it can’t do it.
Has anyone out there ever seen this? I tried looking at “nvidia-smi -a” to see if the cards were going nuts and yet they both look like:
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
So before I try and engage PGI’s parent about this, I thought I’d ask here if anyone had ever encountered it before.
Thanks,
Matt