Incidental error 700 - an illegal memory access is encountered

Hello!

I am having a trouble, my CUDA program sometimes raise error #700 - an illegal memory access is encountered, and it only happens in 8 out of about 70 of my customers.

The program is running on Dell 7070 PC, with a Quadro P620 graphic card, Windows 10 Home, driver version is 460.89. It’s developed with CUDA 10.2.

This error does not happen every time on every PC. It has been implemented to about 70 customers for about 2 months, only 8 of them has reported this problem (if this error happen, customer must report). For the 8 reported customers, this error happens about once or twice a week. Every time when this error happen, I re-run the program with the same data, this error will continuously happen until I reboot the computer, nor the screen goes black and the GPU is recovered. sometimes when the error happen, there will be a log in windows event viewer like this: Error Level: error, source: nvlddmkm, event ID: 13. Event detail can be:

\Device\Video3
Graphics SM Warp Exception on(GPC 1, TPC 1): Out of Range Address

Or

\Device\Video3
Graphics Exception: SER 0x504648=0x137000e 0x504650=0x20 0x504644=0xd3eff2 0x50464c=0x17f

Exception can be on (GPC0, TPC 0), (GPC0, TPC 1), (GPC1, TPC 0) or (GPC1, TPC 1).

I have tried to use CUDA-memcheck to test the program, the result shows no error.

In my codes, all memory allocation are done at the beginning when the program starts up, but the error happen in the middle. And I have also checked the temporary variables in kernel codes, it seems that they are not possible to exceed the heap size limit. I can show parts of my codes if necessary.

The program is developed with continuous integration and continuous test, so it has been running on my development PC for thousands times. In addition, when the error occurred in customers PC, I have copied the data back to my PC, then I run automatic test for over 100 time, this error does not happen.

I want to exchange PC or GPU between customers that have met this problem and haven’t met this problem, but I can not do this. Today I have just replaced with new P620 GPUs for 2 of the 8 customers, I will keep observing for at least one week.

Does anyone have any idea about this problem?

When the machine is in that state, did you test it with cuda-memcheck?

I’m not sure if that means the screen is going black and the GPU is recovered, or it isn’t. But if it is, that is a solid indication of a GPU kernel duration timeout. You may be running into that condition, and it could easily be intermittent or data-dependent. That is something you will need to deal with at a design level of your code, if you want to reliably run in an environment where the WDDM TDR mechanism is active.

While less likely, there is a possibility the root cause is something that happens in host code, by computing a piece of data that when passed to a kernel or CUDA API call ultimately leads to a memory access out of bounds. Integer overflow during a size computation would be one scenario, another would be the inadvertent use of uninitialized data. Running a memory checker like valgrind on the host code might be a good idea.

Hi Robert,

Thank you for reply.

I have no chance to run cuda-memcheck in that state. Firstly, this issue never happen in my develop machine. Secondly, when it happen in customer side, I need to make the machine back to normal state as soon as possible, but the program can run for more than 20 hours for one set of data (in a normal state, it takes only about 20 seconds).

I’m not sure either. At the time the screen went black, CUDA error #719 - unspecified launch failure, was thrown, and another application that also use GPU crashed at the same time. After the screen back to normal, I re-run my program, the program is back to normal, and this issue does not happen again any more.
My kernel functions are all very short, the longest one is less than 10 ms in a normal state. The duration time only depends on the data size, but my data size never change.

Thank you njuffa.

There is no real-time size computation in my codes. All the memory use are wrapped as classes satisfy RAII. Memory is allocated in constructor.
But I haven’t initialize all of them, I will improve that.

The first order of debugging an issue like this is to achieve in-house repro.

If all the machines involved have identically configured hardware and software, you might want to exchange one of the problematic customer machines with a known-good one, so you can then run tests on the system acquired from the customer in-house. If there are differences between your development system and the systems deployed at customers, you would want to ensure that your software passes (at least) a nightly test on a system that is identical to the systems installed at customer premises. In my thinking a nightly test is an extensive and long-running test suite that differs from the kind of light-weight “smoke” tests used in continuous integration.

Are the failing machine operating in a physically challenging environment by any chance? Examples would be: Vibration (e.g. operating on a ship), electromagnetic fields (e.g. operating near large electrical machinery), high altitude (higher chance of cosmic rays affecting DRAM).

I assume that is not the case and the current working hypothesis is that the issue is due to software? Is there any software component that the failing systems have in common that is absent from the systems that are working fine? Example: A specific NVIDIA driver version.

From my work on embedded systems I remember that one design objective was to “die early”. If any of the many internal consistency checks indicated that the system was in an “impossible” state, the system would halt and preserve all relevant evidence. So basically an assertion-based method with crash report. This allowed us to drive the failure rate of the system to essentially zero within a fairly short amount of time prior to first customer deployment. Do you use such assertions in your software?