Post mortem debugging of a compute kernel crash

Hello,
I am encountering issues like : “an illegal memory access was encountered”. With the help of CUDA_DEVICE_WAITS_ON_EXCEPTION=1 as described in (XID Errors :: GPU Deployment and Management Documentation) and cuda-gdb i am able to get more context of the issue.

I have two questions:

  1. Is there a way to programatically get more information about the actual error context from within the crashing process, for example:
    "CUDA Exception: Warp Illegal Address
    The exception was triggered at PC 0x2b8d78a555e0 (sikJNI.cu:5205)
    0x00002b8d78a555f0 in _INTERNAL_53_tmpxft_00006ce0_00000000_10_sikJNI_compute_75_cpp1_ii_39a64ee6::xy_single (result=0x2b8fb6de21c0, ilcl=…, iz=0, h=0x2b935db49538, fan=2, c=0x2b8ec44000e8, closest_fan=0)
    at src/sikJNI.cu:5205
    "
  2. Is there a way to save a crash dump in a way that is digestable by cuda-gdb for post mortem debugging?

Regards.
Jacek Tomaka

cuda has a crash dump facility. You can read about it in the documentation.

Great! Thanks!