Invalid register (AC922 P9 + V100 server)

I’ve been trying to debug a memory issue that I’ve encountered in PPC. I have yet to replicate it on my x86-based systems.

I have a dataflow based implementation of matrix multiplication that uses unified memory. My test executes the operation 1000 times, which includes some unified memory allocation and deallocation in-between.

On the x86 system I have ran numerous tests to ensure the program is stable (cuda-memcheck to check for odd behavior/leaks), and every CUDA code has checks for status codes. I have never seen any errors reported.

On the PPC system, I run the same test, and randomly, maybe running 1000 iterations about a dozen or more times, results in some form of an error that points to ‘cudaFree’ on one of my unified memory pointers.

My latest error inspired me to post it here as I haven’t encountered anything like it before:

Here is the backtrace from cuda-gdb:
#0 0x00007ffff1d467a4 in _int_malloc () from /lib64/
#1 0x00007ffff1d4b0fc in calloc () from /lib64/
#2 0x00007ffff7201690 in cuVDPAUCtxCreate () from /lib64/
#3 0x00007ffff715d150 in cudbgApiDetach () from /lib64/
#4 0x00007ffff715ee50 in cudbgApiDetach () from /lib64/
#5 0x00007ffff7159c58 in cudbgApiDetach () from /lib64/
#6 0x00007ffff727fc18 in cuVDPAUCtxCreate () from /lib64/
#7 0x00007ffff727f8d8 in cuVDPAUCtxCreate () from /lib64/
#8 0x00007ffff7001594 in ?? () from /lib64/
#9 0x00007ffff71b6448 in cuMemFree_v2 () from /lib64/

After running up from #9, I cannot go any further up, instead I got the error:
Invalid register #83886081, expecting 0 <= # < 286

The signal that gets received is: Received signal SIGSEGV, Segmentation fault.

Running cuda-memcheck on this system with cuda 10.1 results in no memory errors found… even when a seg fault or memory dump occurs.

Any ideas on how to proceed with debugging this kind of issue?