Full memory dump of a running neural network training process

I wanted to take full core dump of a training process of resnet running over mxnet. I tried setting the env variables mentioned in the CUDA-GDB :: CUDA Toolkit Documentation. It doesn’t seem to be dumping any data.

I’m running this one TitanV with cuda-9.0. The mxnet code is built from the source with debugging enabled.

Hi, anandj91

Please try below commands

  1. ulimit -c unlimited
  2. CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1 CUDA_COREDUMP_FILE=foobar1 ./$your_app $para1 $para2

Then if there is exception captured, dump data will generated.

GPU dump as ‘foobar1’, CPU dump as ‘core’ by default.

Also the latest cuda toolkit version now is 9.2, you can get it from official site.
Thanks!