Question about ECC memory resiliency


I am running CUDA on a Tesla V100 that contains an HBM2 memory that supports Single-Error Correcting Double-Error Detecting (SECDED).

My question is:

In order to detect the error, the ECC memory has to be enabled or it is enabled by default?
Additionally, how will the system inform me about the error? Will it stop the execution and just inform me or it will automatically solve the error and it will continue the execution?

Thanks in advance!

On V100 it should be enabled by default. You can verify with nvidia-smi tool.

Yes, ECC has to be enabled for it to do anything.

A single-bit (i.e. correctable) error will be corrected on the fly, you won’t receive any notification. However they may be visible in the nvidia-smi tool

Thank you!

Is there any case that ECC memory will just inform me that an error detected?

Is this the case with multi-bits error? Or ECC memory is just responsible for correcting the single-bit errors?

From memory:

Single-bit errors are corrected silently, but their occurrence is counted and reported via nvidia-smi. Continuation is fine since user-visible state has not been corrupted, i.e. the integrity of the data is preserved.

Double-bit errors cannot be corrected, only detected. Detection causes a CUDA status of cudaErrorECCUncorrectable to be returned. Subsequent CUDA kernel launches will fail until the error is explicitly cleared via nvidia-smi. This prevents a CUDA application from continuing on the GPU with known bad data.

Yes, a double-bit (i.e. uncorrectable, but detectable) error will result in a corrupted CUDA context, which means that all CUDA API calls from that point forward will return an error, until you terminate the application/process.