Unknown reason for stopping model training

user150815 · March 4, 2022, 9:08am

Please provide the following information when requesting support.

• Hardware (GTX 1080Ti)
• Network Type (Detectnet_v2)

I work via ssh. The problem is that when I start the training process

tao detectnet_v2 train -e $SPECS_DIR/trafficcamnet_finetune.txt -r $USER_EXPERIMENT_DIR/tcn_d1_finetune3  -k $KEY -n resnet18_detector  --gpus $NUM_GPUS

in the Jupyter Notebook it runs well, but as soon as I close my laptop and go to sleep, and the next morning I open it, I see that the training has not been performed for all epochs, I also do not see the message about early stop training.

I tried to look at some logs:
status.json

{"date": "3/3/2022", "time": "16:16:21", "status": "Starting DetectNet_v2 Training job"}
{"date": "3/3/2022", "time": "16:16:21", "status": "Training gridbox model."}
{"date": "3/3/2022", "time": "16:16:32", "status": "Building DetectNet V2 model"}
{"date": "3/3/2022", "time": "16:16:42", "status": "DetectNet V2 model built."}
{"date": "3/3/2022", "time": "16:16:42", "status": "Building rasterizer."}
{"date": "3/3/2022", "time": "16:16:42", "status": "Rasterizers built."}
{"date": "3/3/2022", "time": "16:16:42", "status": "Building training graph."}
{"date": "3/3/2022", "time": "16:16:43", "status": "Rasterizing tensors."}
{"date": "3/3/2022", "time": "16:16:44", "status": "Tensors rasterized."}
{"date": "3/3/2022", "time": "16:16:46", "status": "Training graph built."}
{"date": "3/3/2022", "time": "16:16:46", "status": "Building validation graph."}
{"date": "3/3/2022", "time": "16:16:46", "status": "Rasterizing tensors."}
{"date": "3/3/2022", "time": "16:16:47", "status": "Tensors rasterized."}
{"date": "3/3/2022", "time": "16:16:47", "status": "Validation graph built."}
{"date": "3/3/2022", "time": "16:16:48", "status": "Running training loop."}
{"loss": 0.06850293278694153, "cur_epoch": 0, "max_epoch": 10, "time_per_epoch": "0:00:00", "ETA": "0:00:00", "learning_rate": 4.999999418942025e-06, "date": "3/3/2022", "time": "16:17:53", "status": "Running."}
{"validation cost": 6.021e-05, "mean average precision": 16.781, "average_precision": {"car": 46.4476, "bus": 24.8801, "long_vehicle": 18.565, "van": 20.3935, "truck": 2.7397, "2wheeler": 2.7919, "pedestrian": 1.6489}, "date": "3/3/2022", "time": "18:30:28", "status": "Evaluation Complete"}
{"loss": 7.775373524054885e-05, "cur_epoch": 1, "max_epoch": 10, "time_per_epoch": "2:13:11.410628", "ETA": "19:58:42.695653", "learning_rate": 0.0004999999655410647, "date": "3/3/2022", "time": "18:30:28", "status": "Running."}
{"validation cost": 5.357e-05, "mean average precision": 14.7426, "average_precision": {"car": 58.2489, "bus": 25.7203, "long_vehicle": 18.2387, "van": 0.7273, "truck": 0.206, "2wheeler": 0.0, "pedestrian": 0.057}, "date": "3/3/2022", "time": "20:42:15", "status": "Evaluation Complete"}
{"loss": 3.412932710489258e-05, "cur_epoch": 2, "max_epoch": 10, "time_per_epoch": "2:11:47.082929", "ETA": "17:34:16.663431", "learning_rate": 0.0004999999655410647, "date": "3/3/2022", "time": "20:42:15", "status": "Running."}
{"validation cost": 5.206e-05, "mean average precision": 17.4663, "average_precision": {"car": 61.5912, "bus": 33.5529, "long_vehicle": 21.8146, "van": 2.1178, "truck": 1.0009, "2wheeler": 0.5706, "pedestrian": 1.6161}, "date": "3/3/2022", "time": "22:50:0", "status": "Evaluation Complete"}
{"loss": 2.3556069209007546e-05, "cur_epoch": 3, "max_epoch": 10, "time_per_epoch": "2:07:45.075439", "ETA": "14:54:15.528071", "learning_rate": 0.0004999999655410647, "date": "3/3/2022", "time": "22:50:0", "status": "Running."}
{"validation cost": 5.66e-05, "mean average precision": 14.9771, "average_precision": {"car": 57.7176, "bus": 29.3261, "long_vehicle": 16.6422, "van": 0.1008, "truck": 0.3109, "2wheeler": 0.029, "pedestrian": 0.7132}, "date": "3/4/2022", "time": "0:55:2", "status": "Evaluation Complete"}
{"loss": 3.6846518923994154e-05, "cur_epoch": 4, "max_epoch": 10, "time_per_epoch": "2:05:01.840346", "ETA": "12:30:11.042078", "learning_rate": 0.0004999999655410647, "date": "3/4/2022", "time": "0:55:2", "status": "Running."}

Cheking docker containers

docke ps
CONTAINER ID        IMAGE                                                     COMMAND                  CREATED             STATUS              PORTS                               NAMES
a98a7ae338e0        nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3   "install_ngc_cli.sh …"   17 hours ago        Up 17 hours                                             hungry_shirley

Cheking active procceses

ps -aux | grep ipy
name      164904  0.0  0.0  11864   648 pts/3    S+   11:05   0:00 grep --color=auto ipy

ps -aux | grep jupyter
name      164970  0.0  0.0  11868   716 pts/3    S+   11:05   0:00 grep --color=auto jupyter

Morganh · March 4, 2022, 1:38pm

Could you try to run training inside the docker and check if the same thing happens?
$ tao detectnet_v2
# detectnet_v2 train xxx

user150815 · March 4, 2022, 1:46pm

I guess yes, but now I was converting my jup notebook to .ipy script and running it like

ipython home/traffic_cam_net.ipy > /home/traffic_cam_net_output.txt

Morganh · March 4, 2022, 2:43pm

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

OK, please monitor if training works.

system · March 18, 2022, 2:43pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Exec inside tao container TAO Toolkit	16	557	January 26, 2022
Error while training detectnet v2 taotollkit on default notebook TAO Toolkit	2	307	March 9, 2024
Errors during training in TAO TAO Toolkit	3	392	January 6, 2024
Classification_pyt error TAO Toolkit jetson	16	92	September 18, 2024
Error when using tao tool to train detectnet_v2 detection model TAO Toolkit	33	1219	February 5, 2022
Error in TAO-Toolkit while training TAO Toolkit	15	1511	July 6, 2022
mAP training several classes = 0.0 and so low with data custom using detectnet_v2 (resnet_18)) TAO Toolkit	33	483	February 1, 2024
Detectnet_v2 acuity is low TAO Toolkit	19	339	July 18, 2023
Tao Training Detectnet_v2 custom dataset : Average precision value 0.0000% TAO Toolkit	5	208	June 25, 2024
TAO toolkit 5.3 actionrecognitionnet training error for joint model, network shape mismatch TAO Toolkit	5	31	December 5, 2024

Unknown reason for stopping model training

Related topics