Please provide the following information when requesting support.
• Hardware (GTX 1080Ti)
• Network Type (Detectnet_v2)
I work via ssh. The problem is that when I start the training process
tao detectnet_v2 train -e $SPECS_DIR/trafficcamnet_finetune.txt -r $USER_EXPERIMENT_DIR/tcn_d1_finetune3 -k $KEY -n resnet18_detector --gpus $NUM_GPUS
in the Jupyter Notebook it runs well, but as soon as I close my laptop and go to sleep, and the next morning I open it, I see that the training has not been performed for all epochs, I also do not see the message about early stop training.
I tried to look at some logs:
status.json
{"date": "3/3/2022", "time": "16:16:21", "status": "Starting DetectNet_v2 Training job"}
{"date": "3/3/2022", "time": "16:16:21", "status": "Training gridbox model."}
{"date": "3/3/2022", "time": "16:16:32", "status": "Building DetectNet V2 model"}
{"date": "3/3/2022", "time": "16:16:42", "status": "DetectNet V2 model built."}
{"date": "3/3/2022", "time": "16:16:42", "status": "Building rasterizer."}
{"date": "3/3/2022", "time": "16:16:42", "status": "Rasterizers built."}
{"date": "3/3/2022", "time": "16:16:42", "status": "Building training graph."}
{"date": "3/3/2022", "time": "16:16:43", "status": "Rasterizing tensors."}
{"date": "3/3/2022", "time": "16:16:44", "status": "Tensors rasterized."}
{"date": "3/3/2022", "time": "16:16:46", "status": "Training graph built."}
{"date": "3/3/2022", "time": "16:16:46", "status": "Building validation graph."}
{"date": "3/3/2022", "time": "16:16:46", "status": "Rasterizing tensors."}
{"date": "3/3/2022", "time": "16:16:47", "status": "Tensors rasterized."}
{"date": "3/3/2022", "time": "16:16:47", "status": "Validation graph built."}
{"date": "3/3/2022", "time": "16:16:48", "status": "Running training loop."}
{"loss": 0.06850293278694153, "cur_epoch": 0, "max_epoch": 10, "time_per_epoch": "0:00:00", "ETA": "0:00:00", "learning_rate": 4.999999418942025e-06, "date": "3/3/2022", "time": "16:17:53", "status": "Running."}
{"validation cost": 6.021e-05, "mean average precision": 16.781, "average_precision": {"car": 46.4476, "bus": 24.8801, "long_vehicle": 18.565, "van": 20.3935, "truck": 2.7397, "2wheeler": 2.7919, "pedestrian": 1.6489}, "date": "3/3/2022", "time": "18:30:28", "status": "Evaluation Complete"}
{"loss": 7.775373524054885e-05, "cur_epoch": 1, "max_epoch": 10, "time_per_epoch": "2:13:11.410628", "ETA": "19:58:42.695653", "learning_rate": 0.0004999999655410647, "date": "3/3/2022", "time": "18:30:28", "status": "Running."}
{"validation cost": 5.357e-05, "mean average precision": 14.7426, "average_precision": {"car": 58.2489, "bus": 25.7203, "long_vehicle": 18.2387, "van": 0.7273, "truck": 0.206, "2wheeler": 0.0, "pedestrian": 0.057}, "date": "3/3/2022", "time": "20:42:15", "status": "Evaluation Complete"}
{"loss": 3.412932710489258e-05, "cur_epoch": 2, "max_epoch": 10, "time_per_epoch": "2:11:47.082929", "ETA": "17:34:16.663431", "learning_rate": 0.0004999999655410647, "date": "3/3/2022", "time": "20:42:15", "status": "Running."}
{"validation cost": 5.206e-05, "mean average precision": 17.4663, "average_precision": {"car": 61.5912, "bus": 33.5529, "long_vehicle": 21.8146, "van": 2.1178, "truck": 1.0009, "2wheeler": 0.5706, "pedestrian": 1.6161}, "date": "3/3/2022", "time": "22:50:0", "status": "Evaluation Complete"}
{"loss": 2.3556069209007546e-05, "cur_epoch": 3, "max_epoch": 10, "time_per_epoch": "2:07:45.075439", "ETA": "14:54:15.528071", "learning_rate": 0.0004999999655410647, "date": "3/3/2022", "time": "22:50:0", "status": "Running."}
{"validation cost": 5.66e-05, "mean average precision": 14.9771, "average_precision": {"car": 57.7176, "bus": 29.3261, "long_vehicle": 16.6422, "van": 0.1008, "truck": 0.3109, "2wheeler": 0.029, "pedestrian": 0.7132}, "date": "3/4/2022", "time": "0:55:2", "status": "Evaluation Complete"}
{"loss": 3.6846518923994154e-05, "cur_epoch": 4, "max_epoch": 10, "time_per_epoch": "2:05:01.840346", "ETA": "12:30:11.042078", "learning_rate": 0.0004999999655410647, "date": "3/4/2022", "time": "0:55:2", "status": "Running."}
Cheking docker containers
docke ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a98a7ae338e0 nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3 "install_ngc_cli.sh …" 17 hours ago Up 17 hours hungry_shirley
Cheking active procceses
ps -aux | grep ipy
name 164904 0.0 0.0 11864 648 pts/3 S+ 11:05 0:00 grep --color=auto ipy
ps -aux | grep jupyter
name 164970 0.0 0.0 11868 716 pts/3 S+ 11:05 0:00 grep --color=auto jupyter