I am running the Modulus v22.03.1 docker container on a Mac. In the past, it was working fine, but at present several of the examples are starting up, printing the training parameters, and then getting killed before the training can start. I have output like this:
[00:55:36] - Jit compiling network arch
[00:55:37] - attempting to restore from: outputs/wave_1d
[00:55:37] - optimizer checkpoint not found
[00:55:37] - model wave_network.pth not found
Do you know why the training process would be getting “Killed” like this?
Hi @gemma.mason ,
We don’t have a “Killed” exit message inside of Modulus, so I suspect this is coming for PyTorch side. Some suggestions would be to turn off JIT (add
jit: false in your config.yaml). Also try lowering the batch size for these examples. Its hard to tell, but you could be running out of memory.
If these do not work, do you have a list of examples that work vs. ones that don’t?
You are correct, it was a memory issue! I found this page which talks about this issue in the PyTorch context: Code stopping with text "Killed"? - PyTorch Forums
Thank you very much for the advice.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.