We’re using Keras and Tensorflow for a deep learning application on some machines in Google Cloud Platform using K80 GPUs.
We’ve been having some problems with Double Bit ECC (DBE) errors. According to the official documentation https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html:
Applications will receive a DBE event notification for graceful exit, and no further context will be created on the GPU until the DBE is mapped out.
When these errors occur our application goes to using 100% CPU. We don’t know what it is doing at this point, but we’ll work on adding some more ways of monitoring it.
My question is how does my application receive these DBE event notifications? Is it a SIGTERM, some type of error I should be catching when call Keras, or something else I should be doing?
Thanks in advance