Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ kernel problem or driver issue?

jacek.tomaka · June 9, 2021, 3:08pm

Hello,
I am trying to find out the cause of “illegal memory access was encountered” error.
cuda-memcheck comes back clean, some executions get to the end.

But sometimes i see the following in dmesg:
“NVRM: Xid (PCI:0000:d8:00): 31, pid=252395, Ch 00000010, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_3 faulted @ 0x2b96_a00c0000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ”

The problem seems to happen only when i make the graphic card pretty busy and it is more consistently happening when i run on 4 devices and not one.

Does this error mean that i am just trying to access area of memory that is not mapped for read, i.e. my kernel is reading outside of the bounds or should i rather look for a driver issue?

If this is not the driver problem, how can i understand this error? Is there an instruction pointer somewhere?
I am seeing this on two machines, one has 4 A30, the other one 4 V100S

Here is the dump from V100S machine, A30 has the same version of the driver.
GPU descriptions: Tesla PG500-216;Tesla PG500-216;Tesla PG500-216;Tesla PG500-216|
NVIDIA driver version: 46073.01

Regards.

Robert_Crovella · June 9, 2021, 3:22pm

The Xid 31 is usually closely related to your “illegal memory access” area. The “illegal” report will usually be accompanied by an Xid 31 report in dmesg. The specifics will vary based on the type of illegal memory access that is happening, for example an illegal code fetch (invalid function pointer) vs. a memory out-of-bounds read, but I won’t be able to decode it for you.

The problem is most often a result of a coding defect, in your kernel code, or in kernel code that is launched by your program. It seems that the problem is intermittent, so since cuda-memcheck perturbs program behavior (execution order) it may be affecting the occurrence of the issue.

If you have a self-contained, repeatable test case, you might also wish to elevate the issue through your system supplier (or post it here, or file a bug). You might also try using compute-sanitizer instead of cuda-memcheck.

jacek.tomaka · June 9, 2021, 3:27pm

Ok, thanks.
In the meantime i was following XID Errors :: GPU Deployment and Management Documentation steps, i.e. exported CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and now after attaching cuda-gdb i got:

"CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x2ac77a3f2d80

Thread 1 “java” received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 1178, block (0,1,0), thread (32,12,0), device 0, sm 8, warp 26, lane 0]
0x00002ac77a3f2d90 inspray_cuda_wrapper_kernel(worker_ctx_, stik_queuepacket, sik_params_ const*, sik_cache_ const*, mig_work_done_*)<<<(1,3,1),(64,16,1)>>> ()
(cuda-gdb)
Program terminated with signal CUDA_EXCEPTION_14, Warp Illegal Address.
The program no longer exists."

So i think this is progress!!! but not sure why cuda-gdb exited at this point?

Robert_Crovella · June 9, 2021, 3:46pm

A warp illegal address, caught this way, corrupts the CUDA context. No further context access/usage is possible at that point.

In any event the problem seems to be in the code you are running. If you can reproduce the issue with a debug build, that may be helpful. If not, recompiling your code with -lineinfo may be helpful.

jacek.tomaka · June 9, 2021, 3:50pm

Ok thanks. This is with -lineinfo. I will try with -G.

jacek.tomaka · June 9, 2021, 4:14pm

It is better with -G. I do get the lines. Thanks.

Topic		Replies	Views
Tracking down CUDA illegal memory access CUDA Programming and Performance	1	1305	February 20, 2015
NVRM: Xid MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_9 faulted @ 0x1_03e00000 Fault is of type FAULT_PTE ACCESS_TYPE_VIRT_READ CUDA Programming and Performance	2	2972	January 10, 2023
cudaDeviceSynchronize always reports an error “ an illegal memory access was encountered”. Why and what I can do ? Thanks CUDA-MEMCHECK cuda	2	1664	April 11, 2022
Alternately Encountering "Illegal Memory Access" Legacy PGI Compilers	4	6689	August 28, 2017
illegal memory access - any help appreciated CUDA Programming and Performance	5	7130	February 8, 2018
Illegal Memory Access but memcheck and sanitizer return 0 error CUDA Programming and Performance	1	899	March 2, 2021
Help catching an illegal memory access CUDA Programming and Performance decoder , cuda , debugger	15	4109	November 7, 2024
Tracking Invalid read size and illegal memory access CUDA Programming and Performance	3	7797	May 24, 2016
Incidental error 700 - an illegal memory access is encountered CUDA Programming and Performance cuda	5	9306	March 25, 2021
CUDA error: an illegal memory access was encountered Linux	0	908	October 28, 2020

Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ kernel problem or driver issue?

Related topics