Error in the middle of running

Test running the (bare installation) after a few iterations I get the following error. I am also running the same case on a different computer with a different gpu and that seems to be running fine.

[22:32:48] - [step: 8900] loss: 5.814e-02, time/iteration: 1.051e+02 ms
[22:32:59] - [step: 9000] record validators time: 6.929e-01s
[22:32:59] - [step: 9000] saved checkpoint to outputs/helmholtz
[22:32:59] - [step: 9000] loss: 3.542e-02, time/iteration: 1.155e+02 ms
[22:33:10] - [step: 9100] loss: 2.047e-02, time/iteration: 1.055e+02 ms
[22:33:20] - [step: 9200] loss: 4.343e-02, time/iteration: 1.055e+02 ms
[22:33:31] - [step: 9300] loss: 2.113e-02, time/iteration: 1.052e+02 ms
[22:33:41] - [step: 9400] loss: 2.224e-02, time/iteration: 1.054e+02 ms
[22:39:15] - loss went to Nans
terminate called after throwing an instance of ‘c10::Error’
what(): CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Hi @mitanshtrip

Unfortunately its hard to tell what the cause is here. Did you try restarting the training from the latest checkpoint?