I am trying to find out the cause of “illegal memory access was encountered” error.
cuda-memcheck comes back clean, some executions get to the end.
But sometimes i see the following in dmesg:
“NVRM: Xid (PCI:0000:d8:00): 31, pid=252395, Ch 00000010, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_3 faulted @ 0x2b96_a00c0000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ”
The problem seems to happen only when i make the graphic card pretty busy and it is more consistently happening when i run on 4 devices and not one.
Does this error mean that i am just trying to access area of memory that is not mapped for read, i.e. my kernel is reading outside of the bounds or should i rather look for a driver issue?
If this is not the driver problem, how can i understand this error? Is there an instruction pointer somewhere?
I am seeing this on two machines, one has 4 A30, the other one 4 V100S
Here is the dump from V100S machine, A30 has the same version of the driver.
GPU descriptions: Tesla PG500-216;Tesla PG500-216;Tesla PG500-216;Tesla PG500-216|
NVIDIA driver version: 46073.01