Loss, acc, val_acc get stablized soon in both train and re-train

music1913 · July 2, 2023, 8:14am

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
Ubuntu PC x64, RTX 3090
• Network Type
(resnet50, 4 class private dataset, Classification)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)

!tao info

Configuration of the TAO Toolkit Instance
dockers: [‘nvidia/tao/tao-toolkit-tf’, ‘nvidia/tao/tao-toolkit-pyt’, ‘nvidia/tao/tao-toolkit-lm’]
format_version: 2.0
toolkit_version: 3.21.11
published_date: 11/08/2021

• Training spec file(If have, please share here)
Attached
classification_retrain_spec.cfg (1.1 KB)
classification_spec.cfg (1.2 KB)

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I’m using TAO (based on my private dataset) to trained a resnet50, 4 class classification model, the dataset images count for each class are: 2k, 5k, 10k, 8k.

I noticed the loss, acc, val_acc stopped getting improve at around epoch 25 both in train(in spec, set with total epoch 80) and re-train(in spec, set with total epoch 120), I always wait to finish all the epochs, and the final model evaluate does not that bad (acc and recall are all around 0.8 for 4 classes).

Train log:

...

d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 16/32 :    0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 17/32 :    0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 18/32 :    0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 19/32 :    0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 20/32 :    0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 21/32 :    0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 22/32 :    0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 23/32 :    0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 24/32 :    0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 25/32 :    0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 26/32 :    0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 27/32 :    0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 28/32 :    0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 29/32 :    0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 30/32 :    0
d2dad9ae4ab1:129:179 [0] NCCL INFO Channel 31/32 :    0
d2dad9ae4ab1:129:179 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
d2dad9ae4ab1:129:179 [0] NCCL INFO Connected all rings
d2dad9ae4ab1:129:179 [0] NCCL INFO Connected all trees
d2dad9ae4ab1:129:179 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
d2dad9ae4ab1:129:179 [0] NCCL INFO comm 0x7f589f781730 rank 0 nranks 1 cudaDev 0 busId 1000 - Init COMPLETE
Epoch 2/80
440/440 [==============================] - 101s 229ms/step - loss: 0.9711 - acc: 0.7397 - val_loss: 0.7625 - val_acc: 0.8201
Epoch 3/80
440/440 [==============================] - 101s 229ms/step - loss: 0.9219 - acc: 0.7635 - val_loss: 0.7358 - val_acc: 0.8334
Epoch 4/80
440/440 [==============================] - 100s 227ms/step - loss: 0.8831 - acc: 0.7842 - val_loss: 0.7099 - val_acc: 0.8406
Epoch 5/80
440/440 [==============================] - 100s 227ms/step - loss: 0.8616 - acc: 0.7936 - val_loss: 0.7086 - val_acc: 0.8398
Epoch 6/80
440/440 [==============================] - 100s 227ms/step - loss: 0.8348 - acc: 0.8080 - val_loss: 0.7122 - val_acc: 0.8364
Epoch 7/80
440/440 [==============================] - 100s 227ms/step - loss: 0.8222 - acc: 0.8089 - val_loss: 0.6885 - val_acc: 0.8482
Epoch 8/80
440/440 [==============================] - 100s 227ms/step - loss: 0.8042 - acc: 0.8178 - val_loss: 0.6960 - val_acc: 0.8414
Epoch 9/80
440/440 [==============================] - 100s 227ms/step - loss: 0.8026 - acc: 0.8175 - val_loss: 0.6842 - val_acc: 0.8539
Epoch 10/80
440/440 [==============================] - 100s 227ms/step - loss: 0.7922 - acc: 0.8259 - val_loss: 0.6766 - val_acc: 0.8543
Epoch 11/80
...
...
Epoch 17/80
440/440 [==============================] - 100s 227ms/step - loss: 0.7757 - acc: 0.8299 - val_loss: 0.6767 - val_acc: 0.8543
Epoch 18/80
440/440 [==============================] - 100s 227ms/step - loss: 0.7709 - acc: 0.8330 - val_loss: 0.6837 - val_acc: 0.8493
Epoch 19/80
440/440 [==============================] - 100s 227ms/step - loss: 0.7712 - acc: 0.8352 - val_loss: 0.6792 - val_acc: 0.8528
Epoch 20/80
440/440 [==============================] - 100s 227ms/step - loss: 0.7652 - acc: 0.8375 - val_loss: 0.6737 - val_acc: 0.8592
Epoch 21/80
440/440 [==============================] - 100s 227ms/step - loss: 0.7649 - acc: 0.8328 - val_loss: 0.6777 - val_acc: 0.8539
Epoch 22/80
440/440 [==============================] - 100s 227ms/step - loss: 0.7776 - acc: 0.8310 - val_loss: 0.6758 - val_acc: 0.8562
Epoch 23/80
440/440 [==============================] - 100s 227ms/step - loss: 0.7618 - acc: 0.8366 - val_loss: 0.6795 - val_acc: 0.8535
Epoch 24/80
440/440 [==============================] - 100s 227ms/step - loss: 0.7664 - acc: 0.8353 - val_loss: 0.6807 - val_acc: 0.8528
Epoch 25/80
440/440 [==============================] - 100s 227ms/step - loss: 0.7627 - acc: 0.8374 - val_loss: 0.6808 - val_acc: 0.8501
Epoch 26/80
440/440 [==============================] - 99s 226ms/step - loss: 0.7632 - acc: 0.8367 - val_loss: 0.6739 - val_acc: 0.8539
Epoch 27/80
440/440 [==============================] - 99s 226ms/step - loss: 0.7670 - acc: 0.8356 - val_loss: 0.6758 - val_acc: 0.8554
...
...

Retrain log:

...
...
Epoch 16/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6659 - acc: 0.8804 - val_loss: 0.6893 - val_acc: 0.8543
Epoch 17/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6590 - acc: 0.8802 - val_loss: 0.6912 - val_acc: 0.8512
Epoch 18/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6493 - acc: 0.8868 - val_loss: 0.6799 - val_acc: 0.8505
Epoch 19/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6488 - acc: 0.8858 - val_loss: 0.6859 - val_acc: 0.8569
Epoch 20/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6573 - acc: 0.8833 - val_loss: 0.6825 - val_acc: 0.8573
Epoch 21/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6574 - acc: 0.8808 - val_loss: 0.6866 - val_acc: 0.8577
Epoch 22/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6560 - acc: 0.8828 - val_loss: 0.6830 - val_acc: 0.8535
Epoch 23/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6428 - acc: 0.8866 - val_loss: 0.6842 - val_acc: 0.8550
Epoch 24/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6507 - acc: 0.8867 - val_loss: 0.6879 - val_acc: 0.8562
Epoch 25/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6377 - acc: 0.8916 - val_loss: 0.6812 - val_acc: 0.8558
Epoch 26/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6417 - acc: 0.8893 - val_loss: 0.6840 - val_acc: 0.8569
Epoch 27/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6490 - acc: 0.8869 - val_loss: 0.6830 - val_acc: 0.8573
Epoch 28/120
440/440 [==============================] - 95s 216ms/step - loss: 0.6478 - acc: 0.8875 - val_loss: 0.6870 - val_acc: 0.8569
...
...

Questions:

Does this mean I just need to set total epoch to 25 both in train and re-train rather than 80 and 120, for saving time?
Any way to further improve the performance?
I once adjust the default learning rate from 0.009 to like 0.015 :
```
# learning_rate
  lr_config {
    step {
      learning_rate: 0.009
      step_size: 10
      gamma: 0.1
    }
  }
```
but not seeing too much improve.
possible to skip the prune and re-train steps, and to get the model directly from a single train process? as my model is running on a PC with GPU(RTX 3060) which is obvious not a low power device?

Morganh · July 2, 2023, 5:16pm

Firstly, suggest to use the latest TAO.
For item1, you can set a lower total epochs.
For item2, it is suggested to use latest TAO. More, if possible, pleas add more training images. If not, please copy existing images for each class, and then run with these oversampling dataset. Also, you can finetune the batch-size.
Last, if you train a model with imagenet dataset previously, please use it as the pretrained model.
For item3, yes, it is possible.

music1913 · July 3, 2023, 2:21am

my private dataset images count for 4 classes are: 2k, 5k, 10k, 8k, is the count still too small? or what is the suggestion numbers.
please copy existing images for each class you mean to copy the images and pasted them into same folder (of course different file names), is this purpose for forming a balanced dataset for each class? if yes, then for the less images class, I need repeat do the copy and paste, like the class of 2k, I need to do it 4 times to achieve 10k, correct?

Morganh · July 3, 2023, 3:07am

Yes, it aims to generate more balance dataset.

More, please try to set lower lr.

music1913 · July 3, 2023, 3:46am

skip the prune and re-train steps, and to get the model directly from a single train process(based on public pre-trained model)? as my model is running on a PC with GPU(RTX 3060) which is obvious not a low power device?

do you have any steps for this?

Morganh · July 3, 2023, 4:45am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

It is possible to get/use the model directly when you finish the training. It is not a must to prune and run retraining.

system · July 24, 2023, 7:02am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
LPRNet: Invalid loss, terminating training TAO Toolkit	24	2165	January 5, 2022
TLT Classification example loss and val_acc unable to converge during training TAO Toolkit nvbugs	12	653	October 12, 2021
Yolo_v4_tiny randomly stops docker container during second or third validation phase with no errors TAO Toolkit yolo	20	881	August 29, 2022
High ram usage with tlt ResNet TAO Toolkit	42	997	July 6, 2022
Cannot train Tao Toolkit UNet model in version v4.0.0 and v4.0.1 TAO Toolkit tao	16	729	July 13, 2023
Error while re-training with custom dataset using tlt file- FasterRCNN TAO Toolkit	5	359	June 26, 2023
mAP training several classes = 0.0 and so low with data custom using detectnet_v2 (resnet_18)) TAO Toolkit	33	495	February 1, 2024
Multi GPU's and invalid loss TAO Toolkit	18	1176	July 19, 2022
Retraining Gesturenet TAO Toolkit	19	677	July 6, 2022
FRCNN: Invalid loss, terminating training TAO Toolkit	3	660	October 12, 2021

Loss, acc, val_acc get stablized soon in both train and re-train

Related topics