Tool to find out the cause of CUDA error

shd2020 · August 6, 2020, 10:11am

Hi,
I am trying to run Tensorflow GPU Docker container (tensorflow:1.15.0-gpu-py3) on my server (GPU Tesla K40c, Ubuntu 16.04, CUDA Version 11.0, Driver Version 450.51.06), but I am getting the error ‘Error polling for event status: failed to query event: CUDA_ERROR_ECC_UNCORRECTABLE: uncorrectable ECC error encountered’ every time. I tried to replicate the error on another server with the same setup (i.e. the same GPU, Ubuntu version etc.), but the code ran smoothly there, so I suspect that the problem might be hardware-related. Are there any tools that I could use to run the GPU diagnostics to see whether that is indeed the case?
Thanks a lot in advance!

rs277 · August 6, 2020, 7:49pm

You may find something useful here:

njuffa · August 7, 2020, 8:59pm

As the error message indicates, the GPU experienced an uncorrectable ECC error. GPUs with ECC implement SECDED (single error correction, double error detection). This means there was an uncorrectable double-bit error. This kind of error is “sticky”, meaning explicit action is required to clear it.

Once such an error occurs, CUDA refuses to issue further work or establish a new context on the device until the error is cleared explicitly. If you look at the status of the GPU with nvidia-smi -q, you should see at least one double-bit error reported in the section ECC errors.

To clear the error, use nvidia-smi --reset-ecc-errors=0, then reboot the system. You may need administrator or superuser privileges to issue this command. The argument 0 indicates that only the volatile error count should be cleared, while retaining the aggregate count (which one might want to track over time). Uncorrectable ECC errors should be very rare events on any particular GPU. If you continue to experience those multiple times on the same device, it may be an indication that this GPU, which physically ages like all electronics devices, is nearing the end of it useful lifetime.

shd2020 · August 10, 2020, 6:37am

@rs277 thanks a lot, I will look into that!

shd2020 · August 10, 2020, 6:41am

@njuffa thank you ever so much! I ran nvidia-smi --reset-ecc-errors=0, which indeed required superuser privileges, and also rebooted the system, but the error popped up again. I wonder whether there is anything else I could try, e.g. using another NVIDIA driver, or CUDA version, or Docker container, or whether I just have to accept that this particular GPU is no good for deep learning any longer.

njuffa · August 10, 2020, 8:13am

If clearing the ECC error after application of reset-ecc-errors is successful at first (check the counter with nvidia-smi immediately after the system has rebooted), but an uncorrectable ECC re-occurs later after running CUDA-accelerated code, there is a good chance that the memory on the GPU has stopped working correctly. A K40 would be about 7 years old at this time, and memory is typically the first thing that goes bad in aging GPUs.

Before you toss out the card, I would suggest repeating the attempt, but with clearing both the volatile and the aggregate ECC error counts this time. If the ECC error still occurs after that, I would toss out this GPU, if that were my hardware. If that is not a realistic option for you, and you believe that deep learning will work correctly with some occasional bad data in the mix, you could try turning off ECC withnvidia-smi.

shd2020 · August 12, 2020, 6:34am

@njuffa Thank you once again! Indeed, it seems that the GPU in question is of no use any longer, so I replaced it - with another K40 for the time being (hopefully), haha.

Topic		Replies	Views
ECC error occurs when running cuda code on P100 CUDA Programming and Performance cuda	4	5171	July 1, 2022
P40 - Getting "ECC Double Bit Error" GPU - Hardware cuda , kernel , drive-hardware-setup , gpu	1	873	April 23, 2024
Cuda error in file '*.cu' in line 112 : uncorrectable ECC error encountered. CUDA VS2008 CUD CUDA Programming and Performance	3	2913	February 14, 2012
CUDA-11.8 : bandwidthTest.cu:686 code=46(cudaErrorDevicesUnavailable) "cudaEventCreate(&start)" CUDA Setup and Installation	4	1109	March 31, 2023
ECC Errors with quad Fermi C2070 CUDA Programming and Performance	2	23788	March 24, 2011
Failed call to cuInit CUDA_ERROR_NOT_INITIALIZED (Device mapping: no known devices) CUDA Setup and Installation	7	6439	November 27, 2018
TensorflowGPU problems RTX 2070 Super Deep Learning (Training & Inference) tensorflow	0	473	June 8, 2020
Nvidia Tesla P100 keeps throwing ECC errors CUDA Programming and Performance cuda , ubuntu , driver	2	522	July 2, 2024
Cuda Error #4 that requires PC Reboot, Help!!! CUDA Programming and Performance	17	9594	September 17, 2013
failure to set vgpu computing mode from prohibited to default Linux	11	3989	September 19, 2022

Tool to find out the cause of CUDA error

Related topics