Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ kernel problem or driver issue?

Hello,
I am trying to find out the cause of “illegal memory access was encountered” error.
cuda-memcheck comes back clean, some executions get to the end.

But sometimes i see the following in dmesg:
“NVRM: Xid (PCI:0000:d8:00): 31, pid=252395, Ch 00000010, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_3 faulted @ 0x2b96_a00c0000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ”

The problem seems to happen only when i make the graphic card pretty busy and it is more consistently happening when i run on 4 devices and not one.

Does this error mean that i am just trying to access area of memory that is not mapped for read, i.e. my kernel is reading outside of the bounds or should i rather look for a driver issue?

If this is not the driver problem, how can i understand this error? Is there an instruction pointer somewhere?
I am seeing this on two machines, one has 4 A30, the other one 4 V100S

Here is the dump from V100S machine, A30 has the same version of the driver.
GPU descriptions: Tesla PG500-216;Tesla PG500-216;Tesla PG500-216;Tesla PG500-216|
NVIDIA driver version: 46073.01

Regards.

The Xid 31 is usually closely related to your “illegal memory access” area. The “illegal” report will usually be accompanied by an Xid 31 report in dmesg. The specifics will vary based on the type of illegal memory access that is happening, for example an illegal code fetch (invalid function pointer) vs. a memory out-of-bounds read, but I won’t be able to decode it for you.

The problem is most often a result of a coding defect, in your kernel code, or in kernel code that is launched by your program. It seems that the problem is intermittent, so since cuda-memcheck perturbs program behavior (execution order) it may be affecting the occurrence of the issue.

If you have a self-contained, repeatable test case, you might also wish to elevate the issue through your system supplier (or post it here, or file a bug). You might also try using compute-sanitizer instead of cuda-memcheck.

Ok, thanks.
In the meantime i was following XID Errors :: GPU Deployment and Management Documentation steps, i.e. exported CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and now after attaching cuda-gdb i got:

"CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x2ac77a3f2d80

Thread 1 “java” received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 1178, block (0,1,0), thread (32,12,0), device 0, sm 8, warp 26, lane 0]
0x00002ac77a3f2d90 inspray_cuda_wrapper_kernel(worker_ctx_, stik_queuepacket, sik_params_ const*, sik_cache_ const*, mig_work_done_*)<<<(1,3,1),(64,16,1)>>> ()
(cuda-gdb)
Program terminated with signal CUDA_EXCEPTION_14, Warp Illegal Address.
The program no longer exists."

So i think this is progress!!! but not sure why cuda-gdb exited at this point?

A warp illegal address, caught this way, corrupts the CUDA context. No further context access/usage is possible at that point.

In any event the problem seems to be in the code you are running. If you can reproduce the issue with a debug build, that may be helpful. If not, recompiling your code with -lineinfo may be helpful.

Ok thanks. This is with -lineinfo. I will try with -G.

It is better with -G. I do get the lines. Thanks.