Unknown reason for stopping model training

Please provide the following information when requesting support.

• Hardware (GTX 1080Ti)
• Network Type (Detectnet_v2)

I work via ssh. The problem is that when I start the training process

tao detectnet_v2 train -e $SPECS_DIR/trafficcamnet_finetune.txt -r $USER_EXPERIMENT_DIR/tcn_d1_finetune3  -k $KEY -n resnet18_detector  --gpus $NUM_GPUS

in the Jupyter Notebook it runs well, but as soon as I close my laptop and go to sleep, and the next morning I open it, I see that the training has not been performed for all epochs, I also do not see the message about early stop training.

I tried to look at some logs:
status.json

{"date": "3/3/2022", "time": "16:16:21", "status": "Starting DetectNet_v2 Training job"}
{"date": "3/3/2022", "time": "16:16:21", "status": "Training gridbox model."}
{"date": "3/3/2022", "time": "16:16:32", "status": "Building DetectNet V2 model"}
{"date": "3/3/2022", "time": "16:16:42", "status": "DetectNet V2 model built."}
{"date": "3/3/2022", "time": "16:16:42", "status": "Building rasterizer."}
{"date": "3/3/2022", "time": "16:16:42", "status": "Rasterizers built."}
{"date": "3/3/2022", "time": "16:16:42", "status": "Building training graph."}
{"date": "3/3/2022", "time": "16:16:43", "status": "Rasterizing tensors."}
{"date": "3/3/2022", "time": "16:16:44", "status": "Tensors rasterized."}
{"date": "3/3/2022", "time": "16:16:46", "status": "Training graph built."}
{"date": "3/3/2022", "time": "16:16:46", "status": "Building validation graph."}
{"date": "3/3/2022", "time": "16:16:46", "status": "Rasterizing tensors."}
{"date": "3/3/2022", "time": "16:16:47", "status": "Tensors rasterized."}
{"date": "3/3/2022", "time": "16:16:47", "status": "Validation graph built."}
{"date": "3/3/2022", "time": "16:16:48", "status": "Running training loop."}
{"loss": 0.06850293278694153, "cur_epoch": 0, "max_epoch": 10, "time_per_epoch": "0:00:00", "ETA": "0:00:00", "learning_rate": 4.999999418942025e-06, "date": "3/3/2022", "time": "16:17:53", "status": "Running."}
{"validation cost": 6.021e-05, "mean average precision": 16.781, "average_precision": {"car": 46.4476, "bus": 24.8801, "long_vehicle": 18.565, "van": 20.3935, "truck": 2.7397, "2wheeler": 2.7919, "pedestrian": 1.6489}, "date": "3/3/2022", "time": "18:30:28", "status": "Evaluation Complete"}
{"loss": 7.775373524054885e-05, "cur_epoch": 1, "max_epoch": 10, "time_per_epoch": "2:13:11.410628", "ETA": "19:58:42.695653", "learning_rate": 0.0004999999655410647, "date": "3/3/2022", "time": "18:30:28", "status": "Running."}
{"validation cost": 5.357e-05, "mean average precision": 14.7426, "average_precision": {"car": 58.2489, "bus": 25.7203, "long_vehicle": 18.2387, "van": 0.7273, "truck": 0.206, "2wheeler": 0.0, "pedestrian": 0.057}, "date": "3/3/2022", "time": "20:42:15", "status": "Evaluation Complete"}
{"loss": 3.412932710489258e-05, "cur_epoch": 2, "max_epoch": 10, "time_per_epoch": "2:11:47.082929", "ETA": "17:34:16.663431", "learning_rate": 0.0004999999655410647, "date": "3/3/2022", "time": "20:42:15", "status": "Running."}
{"validation cost": 5.206e-05, "mean average precision": 17.4663, "average_precision": {"car": 61.5912, "bus": 33.5529, "long_vehicle": 21.8146, "van": 2.1178, "truck": 1.0009, "2wheeler": 0.5706, "pedestrian": 1.6161}, "date": "3/3/2022", "time": "22:50:0", "status": "Evaluation Complete"}
{"loss": 2.3556069209007546e-05, "cur_epoch": 3, "max_epoch": 10, "time_per_epoch": "2:07:45.075439", "ETA": "14:54:15.528071", "learning_rate": 0.0004999999655410647, "date": "3/3/2022", "time": "22:50:0", "status": "Running."}
{"validation cost": 5.66e-05, "mean average precision": 14.9771, "average_precision": {"car": 57.7176, "bus": 29.3261, "long_vehicle": 16.6422, "van": 0.1008, "truck": 0.3109, "2wheeler": 0.029, "pedestrian": 0.7132}, "date": "3/4/2022", "time": "0:55:2", "status": "Evaluation Complete"}
{"loss": 3.6846518923994154e-05, "cur_epoch": 4, "max_epoch": 10, "time_per_epoch": "2:05:01.840346", "ETA": "12:30:11.042078", "learning_rate": 0.0004999999655410647, "date": "3/4/2022", "time": "0:55:2", "status": "Running."}

Cheking docker containers

docke ps
CONTAINER ID        IMAGE                                                     COMMAND                  CREATED             STATUS              PORTS                               NAMES
a98a7ae338e0        nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3   "install_ngc_cli.sh …"   17 hours ago        Up 17 hours                                             hungry_shirley

Cheking active procceses

ps -aux | grep ipy
name      164904  0.0  0.0  11864   648 pts/3    S+   11:05   0:00 grep --color=auto ipy

ps -aux | grep jupyter
name      164970  0.0  0.0  11868   716 pts/3    S+   11:05   0:00 grep --color=auto jupyter

Could you try to run training inside the docker and check if the same thing happens?
$ tao detectnet_v2
# detectnet_v2 train xxx

I guess yes, but now I was converting my jup notebook to .ipy script and running it like

ipython home/traffic_cam_net.ipy > /home/traffic_cam_net_output.txt

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

OK, please monitor if training works.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.