Tool to find out the cause of CUDA error

Hi,
I am trying to run Tensorflow GPU Docker container (tensorflow:1.15.0-gpu-py3) on my server (GPU Tesla K40c, Ubuntu 16.04, CUDA Version 11.0, Driver Version 450.51.06), but I am getting the error ‘Error polling for event status: failed to query event: CUDA_ERROR_ECC_UNCORRECTABLE: uncorrectable ECC error encountered’ every time. I tried to replicate the error on another server with the same setup (i.e. the same GPU, Ubuntu version etc.), but the code ran smoothly there, so I suspect that the problem might be hardware-related. Are there any tools that I could use to run the GPU diagnostics to see whether that is indeed the case?
Thanks a lot in advance!

You may find something useful here:

1 Like

As the error message indicates, the GPU experienced an uncorrectable ECC error. GPUs with ECC implement SECDED (single error correction, double error detection). This means there was an uncorrectable double-bit error. This kind of error is “sticky”, meaning explicit action is required to clear it.

Once such an error occurs, CUDA refuses to issue further work or establish a new context on the device until the error is cleared explicitly. If you look at the status of the GPU with nvidia-smi -q, you should see at least one double-bit error reported in the section ECC errors.

To clear the error, use nvidia-smi --reset-ecc-errors=0, then reboot the system. You may need administrator or superuser privileges to issue this command. The argument 0 indicates that only the volatile error count should be cleared, while retaining the aggregate count (which one might want to track over time). Uncorrectable ECC errors should be very rare events on any particular GPU. If you continue to experience those multiple times on the same device, it may be an indication that this GPU, which physically ages like all electronics devices, is nearing the end of it useful lifetime.

2 Likes

@rs277 thanks a lot, I will look into that!

@njuffa thank you ever so much! I ran nvidia-smi --reset-ecc-errors=0, which indeed required superuser privileges, and also rebooted the system, but the error popped up again. I wonder whether there is anything else I could try, e.g. using another NVIDIA driver, or CUDA version, or Docker container, or whether I just have to accept that this particular GPU is no good for deep learning any longer.

If clearing the ECC error after application of reset-ecc-errors is successful at first (check the counter with nvidia-smi immediately after the system has rebooted), but an uncorrectable ECC re-occurs later after running CUDA-accelerated code, there is a good chance that the memory on the GPU has stopped working correctly. A K40 would be about 7 years old at this time, and memory is typically the first thing that goes bad in aging GPUs.

Before you toss out the card, I would suggest repeating the attempt, but with clearing both the volatile and the aggregate ECC error counts this time. If the ECC error still occurs after that, I would toss out this GPU, if that were my hardware. If that is not a realistic option for you, and you believe that deep learning will work correctly with some occasional bad data in the mix, you could try turning off ECC withnvidia-smi.

1 Like

@njuffa Thank you once again! Indeed, it seems that the GPU in question is of no use any longer, so I replaced it - with another K40 for the time being (hopefully), haha.