Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc)
Ubuntu PC x64, RTX 3090
• Network Type
(resnet50, 4 class private dataset, Classification)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
!tao info
Configuration of the TAO Toolkit Instance
dockers: [‘nvidia/tao/tao-toolkit-tf’, ‘nvidia/tao/tao-toolkit-pyt’, ‘nvidia/tao/tao-toolkit-lm’]
format_version: 2.0
toolkit_version: 3.21.11
published_date: 11/08/2021
• Training spec file(If have, please share here)
Attached
classification_retrain_spec.cfg (1.1 KB)
classification_spec.cfg (1.2 KB)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
I’m using TAO
(based on my private dataset) to trained a resnet50, 4 class classification
model, the dataset images count for each class are: 2k, 5k, 10k, 8k.
I noticed the loss
, acc
, val_acc
stopped getting improve at around epoch 25
both in train(in spec, set with total epoch 80) and re-train(in spec, set with total epoch 120), I always wait to finish all the epochs, and the final model evaluate does not that bad (acc and recall are all around 0.8 for 4 classes).
Train log:
...
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 16/32 : 0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 17/32 : 0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 18/32 : 0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 19/32 : 0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 20/32 : 0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 21/32 : 0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 22/32 : 0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 23/32 : 0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 24/32 : 0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 25/32 : 0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 26/32 : 0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 27/32 : 0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 28/32 : 0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 29/32 : 0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 30/32 : 0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 31/32 : 0
d2dad9ae4ab1:129:179 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
d2dad9ae4ab1:129:179 [0] NCCL INFO Connected all rings
d2dad9ae4ab1:129:179 [0] NCCL INFO Connected all trees
d2dad9ae4ab1:129:179 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
d2dad9ae4ab1:129:179 [0] NCCL INFO comm 0x7f589f781730 rank 0 nranks 1 cudaDev 0 busId 1000 - Init COMPLETE
Epoch 2/80
440/440 [==============================] - 101s 229ms/step - loss: 0.9711 - acc: 0.7397 - val_loss: 0.7625 - val_acc: 0.8201
Epoch 3/80
440/440 [==============================] - 101s 229ms/step - loss: 0.9219 - acc: 0.7635 - val_loss: 0.7358 - val_acc: 0.8334
Epoch 4/80
440/440 [==============================] - 100s 227ms/step - loss: 0.8831 - acc: 0.7842 - val_loss: 0.7099 - val_acc: 0.8406
Epoch 5/80
440/440 [==============================] - 100s 227ms/step - loss: 0.8616 - acc: 0.7936 - val_loss: 0.7086 - val_acc: 0.8398
Epoch 6/80
440/440 [==============================] - 100s 227ms/step - loss: 0.8348 - acc: 0.8080 - val_loss: 0.7122 - val_acc: 0.8364
Epoch 7/80
440/440 [==============================] - 100s 227ms/step - loss: 0.8222 - acc: 0.8089 - val_loss: 0.6885 - val_acc: 0.8482
Epoch 8/80
440/440 [==============================] - 100s 227ms/step - loss: 0.8042 - acc: 0.8178 - val_loss: 0.6960 - val_acc: 0.8414
Epoch 9/80
440/440 [==============================] - 100s 227ms/step - loss: 0.8026 - acc: 0.8175 - val_loss: 0.6842 - val_acc: 0.8539
Epoch 10/80
440/440 [==============================] - 100s 227ms/step - loss: 0.7922 - acc: 0.8259 - val_loss: 0.6766 - val_acc: 0.8543
Epoch 11/80
...
...
Epoch 17/80
440/440 [==============================] - 100s 227ms/step - loss: 0.7757 - acc: 0.8299 - val_loss: 0.6767 - val_acc: 0.8543
Epoch 18/80
440/440 [==============================] - 100s 227ms/step - loss: 0.7709 - acc: 0.8330 - val_loss: 0.6837 - val_acc: 0.8493
Epoch 19/80
440/440 [==============================] - 100s 227ms/step - loss: 0.7712 - acc: 0.8352 - val_loss: 0.6792 - val_acc: 0.8528
Epoch 20/80
440/440 [==============================] - 100s 227ms/step - loss: 0.7652 - acc: 0.8375 - val_loss: 0.6737 - val_acc: 0.8592
Epoch 21/80
440/440 [==============================] - 100s 227ms/step - loss: 0.7649 - acc: 0.8328 - val_loss: 0.6777 - val_acc: 0.8539
Epoch 22/80
440/440 [==============================] - 100s 227ms/step - loss: 0.7776 - acc: 0.8310 - val_loss: 0.6758 - val_acc: 0.8562
Epoch 23/80
440/440 [==============================] - 100s 227ms/step - loss: 0.7618 - acc: 0.8366 - val_loss: 0.6795 - val_acc: 0.8535
Epoch 24/80
440/440 [==============================] - 100s 227ms/step - loss: 0.7664 - acc: 0.8353 - val_loss: 0.6807 - val_acc: 0.8528
Epoch 25/80
440/440 [==============================] - 100s 227ms/step - loss: 0.7627 - acc: 0.8374 - val_loss: 0.6808 - val_acc: 0.8501
Epoch 26/80
440/440 [==============================] - 99s 226ms/step - loss: 0.7632 - acc: 0.8367 - val_loss: 0.6739 - val_acc: 0.8539
Epoch 27/80
440/440 [==============================] - 99s 226ms/step - loss: 0.7670 - acc: 0.8356 - val_loss: 0.6758 - val_acc: 0.8554
...
...
Retrain log:
...
...
Epoch 16/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6659 - acc: 0.8804 - val_loss: 0.6893 - val_acc: 0.8543
Epoch 17/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6590 - acc: 0.8802 - val_loss: 0.6912 - val_acc: 0.8512
Epoch 18/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6493 - acc: 0.8868 - val_loss: 0.6799 - val_acc: 0.8505
Epoch 19/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6488 - acc: 0.8858 - val_loss: 0.6859 - val_acc: 0.8569
Epoch 20/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6573 - acc: 0.8833 - val_loss: 0.6825 - val_acc: 0.8573
Epoch 21/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6574 - acc: 0.8808 - val_loss: 0.6866 - val_acc: 0.8577
Epoch 22/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6560 - acc: 0.8828 - val_loss: 0.6830 - val_acc: 0.8535
Epoch 23/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6428 - acc: 0.8866 - val_loss: 0.6842 - val_acc: 0.8550
Epoch 24/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6507 - acc: 0.8867 - val_loss: 0.6879 - val_acc: 0.8562
Epoch 25/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6377 - acc: 0.8916 - val_loss: 0.6812 - val_acc: 0.8558
Epoch 26/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6417 - acc: 0.8893 - val_loss: 0.6840 - val_acc: 0.8569
Epoch 27/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6490 - acc: 0.8869 - val_loss: 0.6830 - val_acc: 0.8573
Epoch 28/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6478 - acc: 0.8875 - val_loss: 0.6870 - val_acc: 0.8569
...
...
Questions:
-
Does this mean I just need to set total epoch to 25 both in train and re-train rather than 80 and 120, for saving time?
-
Any way to further improve the performance?
I once adjust the default learning rate from0.009
to like0.015
:# learning_rate lr_config { step { learning_rate: 0.009 step_size: 10 gamma: 0.1 } }
but not seeing too much improve.
-
possible to skip the
prune
andre-train
steps, and to get the model directly from a single train process? as my model is running on a PC with GPU(RTX 3060) which is obvious not a low power device?