Resume model with TAO

• Network Type Yolo_v4

Hello, I have a question about resume function, of tao. I set the parameter resume_model_path to take the model from 80 epoch, which finished with

{"epoch": 80, "max_epoch": 100, "time_per_epoch": "0:28:49.739638", "eta": "9:36:34.792757", "date": "10/30/2022", "time": "10:44:6", "status": "RUNNING", "verbosity": "INFO", "message": "Training loop in progress", "graphical": {"loss": "233.5933", "learning_rate": "0.0002002", "validation_loss": "117.58633340778847", "mean average precision": "0.2888132851081271"}, "kpi": {"mean average precision": "0.2888132851081271"}}

after this, when model resumed first log is :

{"epoch": 81, "max_epoch": 200, "time_per_epoch": "0:28:47.774642", "eta": "2 days, 9:06:45.182397", "date": "10/31/2022", "time": "14:26:37", "status": "RUNNING", "verbosity": "INFO", "message": "Training loop in progress", "graphical": {"loss": "240.46246", "learning_rate": "0.00038922785", "mean average precision": "nan", "validation_loss": "nan"}, "kpi": {"mean average precision": "nan"}}

Why didnt resume with the same learning rate?]
At epoch 80 "loss": "233.5933", "learning_rate": "0.0002002", and after resume
"loss": "240.46246", "learning_rate": "0.00038922785".
Also it took another 40 epochs, to get close to results from epoch 80, at epoch 120
"loss": "237.4374", "learning_rate": "0.0003176395"

Did you resume training without changing epochs?
I saw you first trained with "max_epoch : 100 " but seems that you set epoch 120 in the following training.

As you can see… at first run it was max_epoch=100. At second run with resume max_epoch=200…but what is the point of your reply?

Please keep the same epoch when resume training.

Okay, I understand.
I have another question, how can I implement this paragraph from yolo_v4 with COCO dataset, paper

In MS COCO object detection experiments, the de-
fault hyper-parameters are as follows: the training steps is
500,500; the step decay learning rate scheduling strategy is
adopted with initial learning rate 0.01 and multiply with a
factor 0.1 at the 400,000 steps and the 450,000 steps, re-
spectively; The momentum and weight decay are respec-
tively set as 0.9 and 0.0005. All architectures use a sin-
gle GPU to execute multi-scale training in the batch size
of 64 while mini-batch size is 8 or 4 depend on the ar-
chitectures and GPU memory limitation. 

in tao, I only have a range for the learning rate, but I cannot specify when to drop or increase, as they say in paper “in step 400k multply with 0.1”
Also what is the paramenter for weight decay, does Adam has one>?

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one.
Thanks

In TAO yolov4, see YOLOv4 — TAO Toolkit 3.22.05 documentation , it only supports

soft_start_annealing_schedule
soft_start_cosine_annealing_schedule

For soft_start_annealing_schedule, you can have a look at DetectNet_v2 — TAO Toolkit 3.22.05 documentation

For weight decay, please set L2 in regularizer. See more in YOLOv4 — TAO Toolkit 3.22.05 documentation

For training yolov4 with COCO dataset, refer to tao_toolkit_recipes/tao_object_dection/yolov4 at main · NVIDIA-AI-IOT/tao_toolkit_recipes · GitHub

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.