My application call a static library developed used CUDA repetitively. maybe thousands of times. it is a iterative solver. every time lots of things should be changed, so i cannot just throw data to GPU and let them be there still.
I am sure the data size is not so big. BUT I run into the problem"Cuda error in file ‘**.cu’ in line 112 : uncorrectable ECC error encountered."
in line 112, the program is as below:
this problem also occurs when I use
Does anyone met with this problem ? thank you !
And it happens all the time?
Looks like a hardware problem to me.
I am not familiar with uncorrectable ECC errors and asked some more knowledgable colleagues for feedback.
From what I understand, once an uncorrectable ECC error occurs, the CUDA driver prevents all further activity in that context. This makes sense, since it is less risky to fail hard then to continue with known incorrect data. To get the GPU back into a working state, you will need to reboot the machine (I would suggest power cycling it for good measure).
You can check ECC error statistics, both correctable and uncorrectable errors, with the nvidia-smi utility. I believe you can also reset these statistics if desired using the same tool; you may need root privileges to do so.
If the problem with uncorrectable ECC errors persists, please submit a bug report at: http://www.nvidia.com/object/support_tpp.html
Please file it against the Tesla Computing card, and provide the product name plus a description of the problem.