Hello!
I am having a trouble, my CUDA program sometimes raise error #700 - an illegal memory access is encountered, and it only happens in 8 out of about 70 of my customers.
The program is running on Dell 7070 PC, with a Quadro P620 graphic card, Windows 10 Home, driver version is 460.89. It’s developed with CUDA 10.2.
This error does not happen every time on every PC. It has been implemented to about 70 customers for about 2 months, only 8 of them has reported this problem (if this error happen, customer must report). For the 8 reported customers, this error happens about once or twice a week. Every time when this error happen, I re-run the program with the same data, this error will continuously happen until I reboot the computer, nor the screen goes black and the GPU is recovered. sometimes when the error happen, there will be a log in windows event viewer like this: Error Level: error, source: nvlddmkm, event ID: 13. Event detail can be:
\Device\Video3
Graphics SM Warp Exception on(GPC 1, TPC 1): Out of Range Address
Or
\Device\Video3
Graphics Exception: SER 0x504648=0x137000e 0x504650=0x20 0x504644=0xd3eff2 0x50464c=0x17f
Exception can be on (GPC0, TPC 0), (GPC0, TPC 1), (GPC1, TPC 0) or (GPC1, TPC 1).
I have tried to use CUDA-memcheck to test the program, the result shows no error.
In my codes, all memory allocation are done at the beginning when the program starts up, but the error happen in the middle. And I have also checked the temporary variables in kernel codes, it seems that they are not possible to exceed the heap size limit. I can show parts of my codes if necessary.
The program is developed with continuous integration and continuous test, so it has been running on my development PC for thousands times. In addition, when the error occurred in customers PC, I have copied the data back to my PC, then I run automatic test for over 100 time, this error does not happen.
I want to exchange PC or GPU between customers that have met this problem and haven’t met this problem, but I can not do this. Today I have just replaced with new P620 GPUs for 2 of the 8 customers, I will keep observing for at least one week.
Does anyone have any idea about this problem?