Handling Double Bit Exceptions in Tensorflow

We’re using Keras and Tensorflow for a deep learning application on some machines in Google Cloud Platform using K80 GPUs.

We’ve been having some problems with Double Bit ECC (DBE) errors. According to the official documentation https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html:

Applications will receive a DBE event notification for graceful exit, and no further context will be created on the GPU until the DBE is mapped out.

When these errors occur our application goes to using 100% CPU. We don’t know what it is doing at this point, but we’ll work on adding some more ways of monitoring it.

My question is how does my application receive these DBE event notifications? Is it a SIGTERM, some type of error I should be catching when call Keras, or something else I should be doing?

Thanks in advance

The use of the plural “errors” is a bit disconcerting. The expected error rate for double-bit errors in continuous operation is around one per year per GPU (a K80 comprises two GPUs). I would complain to Google if you are seeing significantly higher DBE rates than that. There may be a few bad GPUs in their systems that they need to replace (all electronics are subject to aging), or cooling may not be adequate (data from supercomputers shows that GPU memory errors have positive correlation with operating temperatures).

It has been a long time since I saw a DBE, but if I recall correctly, the effect is similar to a watchdog timeout failure: the running kernel is stopped, the current CUDA context is destroyed, control is returned to the CUDA application with an appropriate CUDA error status. The difference to a timeout failure (other than a different CUDA status code) is that with a DBE event the creation of new CUDA contexts on the affected GPU is now disabled until the error is cleared. I seem to recall nvidia-smi can be used to clear such errors.

My memory may be faulty, it would be best to wait for confirmation/rebuttal from someone who has dealt with DBEs more recently.

Njuffa, if your memory were faulty you’d surely have seen DBEs more recently.

/me runs away…

:-)

Seriously though, I don’t think biological memory like my aging brain works on a digital basis, using bits. Or does it? Any insights appreciated.

The context is corrupted. Any further attempt to use the context will result in a API error being returned. That is the primary application-level notification. Keras is basically a layer of 3rd party software on top of the (3rd-party framework) application I am referring to here. However, conceptually, the reporting of this “in Keras” would be no different than for example if Keras attempted to allocate more GPU memory than what was available. Any attempt to use the GPU at this point by Keras (i.e. by whatever framework Keras is sitting on top of, e.g. TF) would be met with a runtime-reported error. It is then a question for the Keras community what Keras does with those errors.

The information is also reported in the system logs (e.g. dmesg):

https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#xids

DCGM can be configured to monitor and take action on DBEs:

https://devblogs.nvidia.com/nvidia-data-center-gpu-manager-cluster-administration/

I believe your cross-posting:

https://stackoverflow.com/questions/53609779/handling-double-bit-exceptions-gpu-errors-in-tensorflow

now has a reasonably correct answer also.

Thanks for all your responses, I have enough information to understand how I should be handling it :)