Incidental error 700 - an illegal memory access is encountered

jay6987 · March 24, 2021, 1:08pm

Hello!

I am having a trouble, my CUDA program sometimes raise error #700 - an illegal memory access is encountered, and it only happens in 8 out of about 70 of my customers.

The program is running on Dell 7070 PC, with a Quadro P620 graphic card, Windows 10 Home, driver version is 460.89. It’s developed with CUDA 10.2.

This error does not happen every time on every PC. It has been implemented to about 70 customers for about 2 months, only 8 of them has reported this problem (if this error happen, customer must report). For the 8 reported customers, this error happens about once or twice a week. Every time when this error happen, I re-run the program with the same data, this error will continuously happen until I reboot the computer, nor the screen goes black and the GPU is recovered. sometimes when the error happen, there will be a log in windows event viewer like this: Error Level: error, source: nvlddmkm, event ID: 13. Event detail can be:

\Device\Video3
Graphics SM Warp Exception on(GPC 1, TPC 1): Out of Range Address

Or

\Device\Video3
Graphics Exception: SER 0x504648=0x137000e 0x504650=0x20 0x504644=0xd3eff2 0x50464c=0x17f

Exception can be on (GPC0, TPC 0), (GPC0, TPC 1), (GPC1, TPC 0) or (GPC1, TPC 1).

I have tried to use CUDA-memcheck to test the program, the result shows no error.

In my codes, all memory allocation are done at the beginning when the program starts up, but the error happen in the middle. And I have also checked the temporary variables in kernel codes, it seems that they are not possible to exceed the heap size limit. I can show parts of my codes if necessary.

The program is developed with continuous integration and continuous test, so it has been running on my development PC for thousands times. In addition, when the error occurred in customers PC, I have copied the data back to my PC, then I run automatic test for over 100 time, this error does not happen.

I want to exchange PC or GPU between customers that have met this problem and haven’t met this problem, but I can not do this. Today I have just replaced with new P620 GPUs for 2 of the 8 customers, I will keep observing for at least one week.

Does anyone have any idea about this problem?

Robert_Crovella · March 24, 2021, 3:20pm

When the machine is in that state, did you test it with cuda-memcheck?

I’m not sure if that means the screen is going black and the GPU is recovered, or it isn’t. But if it is, that is a solid indication of a GPU kernel duration timeout. You may be running into that condition, and it could easily be intermittent or data-dependent. That is something you will need to deal with at a design level of your code, if you want to reliably run in an environment where the WDDM TDR mechanism is active.

njuffa · March 24, 2021, 5:42pm

While less likely, there is a possibility the root cause is something that happens in host code, by computing a piece of data that when passed to a kernel or CUDA API call ultimately leads to a memory access out of bounds. Integer overflow during a size computation would be one scenario, another would be the inadvertent use of uninitialized data. Running a memory checker like valgrind on the host code might be a good idea.

jay6987 · March 25, 2021, 1:20pm

Hi Robert,

Thank you for reply.

I have no chance to run cuda-memcheck in that state. Firstly, this issue never happen in my develop machine. Secondly, when it happen in customer side, I need to make the machine back to normal state as soon as possible, but the program can run for more than 20 hours for one set of data (in a normal state, it takes only about 20 seconds).

I’m not sure either. At the time the screen went black, CUDA error #719 - unspecified launch failure, was thrown, and another application that also use GPU crashed at the same time. After the screen back to normal, I re-run my program, the program is back to normal, and this issue does not happen again any more.
My kernel functions are all very short, the longest one is less than 10 ms in a normal state. The duration time only depends on the data size, but my data size never change.

jay6987 · March 25, 2021, 1:32pm

Thank you njuffa.

There is no real-time size computation in my codes. All the memory use are wrapped as classes satisfy RAII. Memory is allocated in constructor.
But I haven’t initialize all of them, I will improve that.

njuffa · March 25, 2021, 8:55pm

The first order of debugging an issue like this is to achieve in-house repro.

If all the machines involved have identically configured hardware and software, you might want to exchange one of the problematic customer machines with a known-good one, so you can then run tests on the system acquired from the customer in-house. If there are differences between your development system and the systems deployed at customers, you would want to ensure that your software passes (at least) a nightly test on a system that is identical to the systems installed at customer premises. In my thinking a nightly test is an extensive and long-running test suite that differs from the kind of light-weight “smoke” tests used in continuous integration.

Are the failing machine operating in a physically challenging environment by any chance? Examples would be: Vibration (e.g. operating on a ship), electromagnetic fields (e.g. operating near large electrical machinery), high altitude (higher chance of cosmic rays affecting DRAM).

I assume that is not the case and the current working hypothesis is that the issue is due to software? Is there any software component that the failing systems have in common that is absent from the systems that are working fine? Example: A specific NVIDIA driver version.

From my work on embedded systems I remember that one design objective was to “die early”. If any of the many internal consistency checks indicated that the system was in an “impossible” state, the system would halt and preserve all relevant evidence. So basically an assertion-based method with crash report. This allowed us to drive the failure rate of the system to essentially zero within a fairly short amount of time prior to first customer deployment. Do you use such assertions in your software?

Topic		Replies	Views
CUDA error 700 Isaac Sim cuda , gpu	2	326	September 9, 2024
CUDA error 700 - an illegal memory access was encountered TensorRT	8	24075	April 12, 2022
Illegal Memory Access but memcheck and sanitizer return 0 error CUDA Programming and Performance	1	904	March 2, 2021
Getting around apparent CUDA bugs CUDA Programming and Performance	5	1075	September 20, 2011
700 an illegal memory access was encountered CUDA Programming and Performance	1	1376	September 2, 2022
Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ kernel problem or driver issue? CUDA Programming and Performance	6	12073	October 12, 2021
Tracking down CUDA illegal memory access CUDA Programming and Performance	1	1311	February 20, 2015
Help catching an illegal memory access CUDA Programming and Performance decoder , cuda , debugger	15	4326	November 7, 2024
Program hit cudaErrorIllegalAddress (error 700) [...] on CUDA API call to cudaDeviceSynchronize CUDA-MEMCHECK	4	5095	September 29, 2021
Cuda error 77 (0x4d) when increasing problem size CUDA Programming and Performance	5	8979	June 9, 2016

Incidental error 700 - an illegal memory access is encountered

Related topics