Unknown Reason for stopping detectnet_v2 training

• Hardware (V100)
• Network Type (Detectnet_v2)
• TLT Version (3.22.20)
• Training spec file
detectnet_v2_train_resnet18_voc.txt (7.2 KB)

I downloaded the pretrained Resnet18-detectnet_v2 model from NGC and want to re-train it using the Pascal VOC dataset. The labels have been correctly transformed from XML to KITTI format and TF records were generated. All steps are performed on a jupyter notebook connected with server through ssh. The jupyter notebook cv_samples_v1.3.0/detectnet_v2 was used as reference. Once the training has been successfuly started if I quit the notebook/connection times out, the training stops:
nvidia-smi command displays “No running processes”
tao list command displays the container into “running state” but with command “Not in support of DNN task”.
By checking the status.json file, it displays
{“loss”: 7.042660581646487e-05, “cur_epoch”: 70, “max_epoch”: 120, “time_per_epoch”: “0:00:00”, “ETA”: “0:00:00”, “learning_rate”: 0.0004999999655410647, “date”: “5/17/2022”, “time”: “14:47:37”, “status”: “Running.”}
{“Error”: “”, “date”: “5/17/2022”, “time”: “14:49:45”, “status”: “Training was interrupted”}

I reproduced the same error using the cv_samples_v1.3.0/detectnet_v2 notebook with KITTI dataset (at first I did not realize because the training does not take much time).
How can I fix this problem? What if I have 30h training? Thanks!

Do you mean you quit the training when the training just started?

As soon as I close the notebook, the training (at whichever epoch is) stops. What should I do to keep the training going on the container even if I close the jupyter notebook?

You can run in the terminal instead of jupyter notebook.

It worked, Thanks!

