The Xid 31 is usually closely related to your “illegal memory access” area. The “illegal” report will usually be accompanied by an Xid 31 report in dmesg. The specifics will vary based on the type of illegal memory access that is happening, for example an illegal code fetch (invalid function pointer) vs. a memory out-of-bounds read, but I won’t be able to decode it for you.
The problem is most often a result of a coding defect, in your kernel code, or in kernel code that is launched by your program. It seems that the problem is intermittent, so since cuda-memcheck
perturbs program behavior (execution order) it may be affecting the occurrence of the issue.
If you have a self-contained, repeatable test case, you might also wish to elevate the issue through your system supplier (or post it here, or file a bug). You might also try using compute-sanitizer
instead of cuda-memcheck
.