Retraining Gesturenet

taoonvision · May 19, 2022, 9:00am

Please provide the following information when requesting support.

• Hardware (Ubuntu 18.04 PC with NVIDIA Quadro K610M)
• Network Type (GestureNet)

Hello,
I was trying to train GestureNet model while adding a new class which is the Y_P gesture in the HGR dataset but I can’t get the model to train well (cross_entropy_loss~=10). I am doing everything inside the jupyter notebook for GestureNet

• How to reproduce the issue ?
What I did, step by step:

I modified “Y_P”: “rock” in convert_hgr_to_tlt_data.py to create my dataset with the new class rock. I ran convert_hgr_to_tlt_data.py to create tlt type dataset.
In specs/dataset_experiment_config.json I changed class_weights to:

    "class_weights": {
        "random": 0.16,
        "thumbs_up": 0.16,
        "two": 0.16,
        "stop": 0.10,
        "ok": 0.10,
        "fist": 0.16,
      "rock": 0.16
    }

I then ran command !tao gesturenet dataset_convert
In specs/train_spec.json I changed num_classes, add_new_head and classes:

“add_new_head”: true
“classes”: {
“thumbs_up”: 0,
“fist”: 1,
“stop”: 2,
“ok”: 3,
“two”: 4,
“random”: 5,
“rock”: 6
},
“num_classes”: 7

I then ran !tao gesturenet train -e $SPECS_DIR/train_spec.json \ -k $KEY . The problem here is that the loss stays around 10 and doesn’t get lower. Because of this loss, when evaluating with this trained model, I get these metrics which are really bad:

pred_label fist random two
gt_label
fist 22.2 77.8 0.0
ok 8.33 91.7 0.0
random 34.1 65.9 0.0
rock 0.0 1e+02 0.0
stop 1e+02 0.0 0.0
thumbs_up 60.0 0.0 40.0
two 0.0 1e+02 0.0
All 34.1 63.6 2.27

I also tried with the dataset without changes and it works well. However when I set add_new_head in train_spec file to True, the model get worst. Could someone light me up on what add_new_head really does?

Feel free to ask if you need more information. Thanks in advance.
Nicolas

taoonvision · May 19, 2022, 9:18am

Output of !tao gesturenet train -e $SPECS_DIR/train_spec.json \k $KEY:

Epoch 1/50
1/434 […] - ETA: 2:58 - loss: 16.2064 - categorical_accuracy: 0.0000e+00/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (0.366803). Check your callbacks.
% delta_t_median)
2/434 […] - ETA: 2:57 - loss: 16.2064 - categorical_accuracy: 0.0000e+00/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (0.183856). Check your callbacks.
% delta_t_median)
434/434 [==============================] - 27s 61ms/step - loss: 9.9892 - categorical_accuracy: 0.1659 - val_loss: 10.5450 - val_categorical_accuracy: 0.1643
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/callbacks.py:995: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

2022-05-19 08:31:13,069 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/callbacks.py:995: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

Epoch 2/50
434/434 [==============================] - 26s 59ms/step - loss: 10.4679 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
Epoch 3/50
434/434 [==============================] - 24s 56ms/step - loss: 10.4499 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
Epoch 4/50
434/434 [==============================] - 24s 55ms/step - loss: 10.4499 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
Epoch 5/50
434/434 [==============================] - 24s 56ms/step - loss: 10.4499 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
Epoch 6/50
434/434 [==============================] - 24s 56ms/step - loss: 10.4499 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
Epoch 7/50
434/434 [==============================] - 24s 55ms/step - loss: 10.4499 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
Epoch 8/50
434/434 [==============================] - 24s 56ms/step - loss: 10.4499 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
Epoch 9/50
434/434 [==============================] - 25s 59ms/step - loss: 10.4499 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
Epoch 10/50
434/434 [==============================] - 25s 57ms/step - loss: 10.4499 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
Epoch 11/50
434/434 [==============================] - 23s 53ms/step - loss: 10.4499 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
Epoch 12/50
434/434 [==============================] - 25s 58ms/step - loss: 10.4499 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
2022-05-19 08:35:47,521 [INFO] driveix.classifynet.trainer.classifynet_trainer: Finished training full model
2022-05-19 08:35:55,156 [INFO] driveix.classifynet.trainer.classifynet_trainer: Total Val Loss: 7.027486324310303
2022-05-19 08:35:55,156 [INFO] driveix.classifynet.trainer.classifynet_trainer: Total Val accuracy: 0.16428571939468384
2022-05-19 08:35:55,156 [INFO] main: Training finished successfully.
2022-05-19 10:35:56,505 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Morganh · May 20, 2022, 3:53am

Only set toadd_new_head: false if you want to finetune on dataset with same gestures as pretrained model.

The add_new_head (bool) is a flag whether to add a new head or not. If set to True, it will remove last layer and load pretrained weights and then add new head.

Form the log, the loss seems to be always the same during training.
Did you see this behavior when you “tried with the dataset without changes and it works well.” ?

taoonvision · May 20, 2022, 6:14am

Thank you Morgan for the explanation about add_new_head, that’s what I thought.
No, during training on the normal dataset the loss was going down epoch after epoch as expected, I really don’t know where this problem could come from.
Tell me if I can send you any other information to solve this problem.

Best regards,
Nicolas

Morganh · May 20, 2022, 6:31am

To narrow down, could you remove the pretrained model in the spec file and try to run again?

taoonvision · May 23, 2022, 7:19am

I don’t understand what you mean by “removing the pretrained model in the spec file”, you mean removing the weights from the train_spec file?

What I did is training the base model with base weights (gesturenet_vtrainable_v1.0/model.tlt) and on the base dataset.
The only parameter I changed from the basic config is add_new_head:True.

The training went well, here are the last epochs of the training:

Epoch 40/50
453/453 [==============================] - 29s 63ms/step - loss: 2.2338 - categorical_accuracy: 0.1634 - val_loss: 1.9925 - val_categorical_accuracy: 0.1951
Epoch 41/50
453/453 [==============================] - 28s 62ms/step - loss: 2.2316 - categorical_accuracy: 0.1634 - val_loss: 1.9915 - val_categorical_accuracy: 0.1951
Epoch 42/50
453/453 [==============================] - 28s 61ms/step - loss: 2.2297 - categorical_accuracy: 0.1634 - val_loss: 1.9905 - val_categorical_accuracy: 0.1951
Epoch 43/50
453/453 [==============================] - 28s 63ms/step - loss: 2.2287 - categorical_accuracy: 0.1634 - val_loss: 1.9895 - val_categorical_accuracy: 0.1951
Epoch 44/50
453/453 [==============================] - 28s 62ms/step - loss: 2.2287 - categorical_accuracy: 0.1634 - val_loss: 1.9886 - val_categorical_accuracy: 0.1951
Epoch 45/50
453/453 [==============================] - 29s 63ms/step - loss: 2.2260 - categorical_accuracy: 0.1634 - val_loss: 1.9878 - val_categorical_accuracy: 0.1951
Epoch 46/50
453/453 [==============================] - 28s 62ms/step - loss: 2.2228 - categorical_accuracy: 0.1634 - val_loss: 1.9870 - val_categorical_accuracy: 0.1951
Epoch 47/50
453/453 [==============================] - 28s 61ms/step - loss: 2.2230 - categorical_accuracy: 0.1634 - val_loss: 1.9862 - val_categorical_accuracy: 0.1951
Epoch 48/50
453/453 [==============================] - 28s 61ms/step - loss: 2.2217 - categorical_accuracy: 0.1634 - val_loss: 1.9855 - val_categorical_accuracy: 0.1951
Epoch 49/50
453/453 [==============================] - 27s 61ms/step - loss: 2.2213 - categorical_accuracy: 0.1634 - val_loss: 1.9848 - val_categorical_accuracy: 0.1951
Epoch 50/50
453/453 [==============================] - 29s 63ms/step - loss: 2.2208 - categorical_accuracy: 0.1634 - val_loss: 1.9841 - val_categorical_accuracy: 0.1951

However a problem happens during the evaluation, the model only predicts the class ‘stop’ for all images:

2022-05-23 06:57:26,891 [INFO] driveix.classifynet.evaluator.classifynet_evaluator: —Confusion Matrix—
2022-05-23 06:57:26,891 [INFO] driveix.classifynet.evaluator.classifynet_evaluator: pred_label stop
gt_label
fist 1e+02
ok 1e+02
random 1e+02
stop 1e+02
thumbs_up 1e+02
two 1e+02
All 1e+02

2022-05-23 06:57:27,135 [INFO] driveix.classifynet.evaluator.classifynet_evaluator: precision recall f1-score support
    fist     0.0000    0.0000    0.0000         9
      ok     0.0000    0.0000    0.0000        12
  random     0.0000    0.0000    0.0000        49
    stop     0.1023    1.0000    0.1856         9
thumbs_up 0.0000 0.0000 0.0000 5
two 0.0000 0.0000 0.0000 4
accuracy                         0.1023        88
macro avg 0.0170 0.1667 0.0309 88
weighted avg 0.0105 0.1023 0.0190 88

The model doesn’t have this problem when add_new_head is set to False. Do you have any clue why?

Best regards,
Nicolas

Morganh · May 23, 2022, 7:37am

Yes.

Morganh · May 23, 2022, 7:39am

The last training log is running without pretrained weights, right?

taoonvision · May 23, 2022, 7:55am

No, the previous logs were from a training with the pretrained weights (gesturenet_vtrainable_v1.0/model.tlt).

Here are the logs from a training without any weight specified in train_spec file:

2022-05-23 07:40:01,042 [INFO] driveix.classifynet.models.resnet_vanilla: Model loaded with random weight initilization.

Epoch 1/50
2/453 […] - ETA: 7:34 - loss: 7.4780 - categorical_accuracy: 0.0000e+00/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (0.420138). Check your callbacks.
% delta_t_median)
453/453 [==============================] - 63s 139ms/step - loss: 6.9864 - categorical_accuracy: 0.1810 - val_loss: 6.9593 - val_categorical_accuracy: 0.1382
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/callbacks.py:995: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

2022-05-23 07:41:07,498 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/callbacks.py:995: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

Epoch 2/50
453/453 [==============================] - 59s 131ms/step - loss: 6.9783 - categorical_accuracy: 0.1656 - val_loss: 6.9590 - val_categorical_accuracy: 0.1382
Epoch 3/50
453/453 [==============================] - 61s 135ms/step - loss: 6.9825 - categorical_accuracy: 0.1369 - val_loss: 6.9590 - val_categorical_accuracy: 0.1545
Epoch 4/50
453/453 [==============================] - 61s 134ms/step - loss: 6.9759 - categorical_accuracy: 0.1567 - val_loss: 6.9589 - val_categorical_accuracy: 0.1545
Epoch 5/50
453/453 [==============================] - 59s 131ms/step - loss: 6.9789 - categorical_accuracy: 0.1391 - val_loss: 6.9590 - val_categorical_accuracy: 0.1545
Epoch 6/50
453/453 [==============================] - 59s 130ms/step - loss: 6.9677 - categorical_accuracy: 0.1788 - val_loss: 6.9589 - val_categorical_accuracy: 0.1545
Epoch 7/50
453/453 [==============================] - 59s 130ms/step - loss: 6.9690 - categorical_accuracy: 0.1876 - val_loss: 6.9590 - val_categorical_accuracy: 0.1463
Epoch 8/50
453/453 [==============================] - 56s 125ms/step - loss: 6.9724 - categorical_accuracy: 0.1479 - val_loss: 6.9589 - val_categorical_accuracy: 0.1545
Epoch 9/50
453/453 [==============================] - 60s 134ms/step - loss: 6.9694 - categorical_accuracy: 0.1678 - val_loss: 6.9590 - val_categorical_accuracy: 0.1545
Epoch 10/50
453/453 [==============================] - 60s 133ms/step - loss: 6.9733 - categorical_accuracy: 0.1545 - val_loss: 6.9588 - val_categorical_accuracy: 0.1626
Epoch 11/50
453/453 [==============================] - 60s 132ms/step - loss: 6.9667 - categorical_accuracy: 0.1832 - val_loss: 6.9589 - val_categorical_accuracy: 0.1545
2022-05-23 07:51:19,588 [INFO] driveix.classifynet.trainer.classifynet_trainer: Finished training full model
2022-05-23 07:51:33,892 [INFO] driveix.classifynet.trainer.classifynet_trainer: Total Val Loss: 6.958917617797852
2022-05-23 07:51:33,892 [INFO] driveix.classifynet.trainer.classifynet_trainer: Total Val accuracy: 0.15447154641151428
2022-05-23 07:51:33,893 [INFO] main: Training finished successfully.
2022-05-23 09:51:36,301 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Still have the same problem:

2022-05-23 07:55:05,355 [INFO] driveix.classifynet.evaluator.classifynet_evaluator: Calculating per-class P/R and confusion matrix.
2022-05-23 07:55:13,053 [INFO] driveix.classifynet.evaluator.classifynet_evaluator: —Confusion Matrix—
2022-05-23 07:55:13,053 [INFO] driveix.classifynet.evaluator.classifynet_evaluator: pred_label thumbs_up
gt_label
fist 1e+02
ok 1e+02
random 1e+02
stop 1e+02
thumbs_up 1e+02
two 1e+02
All 1e+02

Thanks for the quick answer.

Morganh · May 23, 2022, 8:06am

Please share the full training spec with us when you train 7 classes.
I will try to reproduce on my side.

taoonvision · May 23, 2022, 8:37am

My train_spec.json file:

{
“random_seed”: 108,
“batch_size”: 1,
“output_experiments_fld”: “/workspace/tao-experiments/gesturenet/”,
“save_weights_path”: “model”,
“trainer”: {
“class”: “ClassifyNetTrainer”,
“module”: “driveix.classifynet.trainer.classifynet_trainer”,
“top_training”: {
“stage_order”: 1,
“loss_fn”: “categorical_crossentropy”,
“train_epochs”: 0,
“num_layers_unfreeze”: 0,
“optimizer”: “rmsprop”
},
“finetuning”: {
“stage_order”: 2,
“train_epochs”: 50,
“loss_fn”: “categorical_crossentropy”,
“initial_lrate”: 5e-05,
“decay_step_size”: 33,
“lr_drop_rate”: 0.5,
“enable_checkpointing”: true,
“num_layers_unfreeze”: 3,
“optimizer”: “sgd”
},
“num_workers”: 1
},
“model”: {
“image_height”: 160,
“image_width”: 160,
“gray_scale_input”: false,
“data_format”: “channels_last”,
“base_model”: “resnet_vanilla”,
“num_layers”: 18,
“use_batch_norm”: true,
“weights_init”: “/workspace/tao-experiments/gesturenet/pretrained_models/gesturenet_vtrainable_v1.0/model.tlt”,
“add_new_head”: true,
“kernel_regularizer_type”: “l2”,
“kernel_regularization_factor”: 0.001
},
“dataset”: {
“image_root_path”: “/workspace/tao-experiments/gesturenet/”,
“classes”: {
“thumbs_up”: 0,
“fist”: 1,
“stop”: 2,
“ok”: 3,
“two”: 4,
“random”: 5,
“rock”: 6
},
“data_path”: “/workspace/tao-experiments/gesturenet/data.json”,
“num_classes”: 7,
“augmentation”: {
“shear_range”: 0.0,
“color_pca_aug”: {
“enable”: false,
“probability”: 0.5
},
“gamma_aug”: {
“enable”: true,
“probability”: 0.5,
“lower_limit”: 0.5,
“upper_limit”: 2.0
},
“rotation_range”: 5,
“brightness_range”: [
0.5,
1.5
],
“occlusion_aug”: {
“max_aspect_ratio”: 3.33,
“max_area”: 0.25,
“enable”: true,
“probability”: 0.5,
“pixel_level”: true,
“min_area”: 0.05,
“min_pixel”: 0,
“max_pixel”: 255,
“min_aspect_ratio”: 0.3
},
“horizontal_flip”: true
}
},
“evaluator”: {
“evaluation_exp_name”: “results”,
“data_path”: “/workspace/tao-experiments/gesturenet/data.json”
}
}

If you want to use the same dataset as I did (with the new class rock) please refer to the first message I posted describing how I did it.
Thanks Morgan.

taoonvision · May 24, 2022, 7:39am

Hello Morgan,
Did you manage to reproduce the problem on your computer?
Best regards,
Nicolas

Morganh · May 24, 2022, 7:43am

Sorry, I have not tried yet. Will run it later.

taoonvision · May 24, 2022, 7:45am

No problem, thank you for your responsiveness. Keep me informed.

Morganh · May 27, 2022, 9:45am

With your steps, I can reproduce your result. Please modify spec similar to below.

{
    "random_seed": 108,
    "batch_size": 2,
    "output_experiments_fld": "/workspace/demo_3.0/forum_repro/gesturenet/",
    "save_weights_path": "model",
    "trainer": {
        "class": "ClassifyNetTrainer",
        "module": "driveix.classifynet.trainer.classifynet_trainer",
        "top_training": {
            "stage_order": 1,
            "loss_fn": "categorical_crossentropy",
            "train_epochs": 1,
            "num_layers_unfreeze": 11,
            "optimizer": "rmsprop"
        },
        "finetuning": {
            "stage_order": 2,
            "train_epochs": 500,
            "loss_fn": "categorical_crossentropy",
            "initial_lrate": 5e-05,
            "decay_step_size": 33,
            "lr_drop_rate": 0.5 ,
            "enable_checkpointing": true,
            "num_layers_unfreeze": 100,
            "optimizer": "sgd"
        },
        "num_workers": 1
    },
    "model": {
        "image_height": 160,
        "image_width": 160,
        "gray_scale_input": false,
        "data_format": "channels_first",
        "base_model": "resnet_vanilla",
        "num_layers": 34,
        "use_batch_norm": false,
        "weights_init": "",
        "add_new_head": false,
 
        "kernel_regularizer_type": "l2",
        "kernel_regularization_factor": 0.001
    },
    "dataset": {
        "image_root_path": "/workspace/demo_3.0/forum_repro/gesturenet",
        "classes": {
            "thumbs_up": 0,
            "fist": 1,
            "stop": 2,
            "ok": 3,
            "two": 4,
            "random": 5,
            "rock": 6
        },
        "data_path": "/workspace/demo_3.0/forum_repro/gesturenet/data.json",
        "num_classes": 7,
        "augmentation": {
            "shear_range": 0.0,
            "color_pca_aug": {
                "enable": false,
                "probability": 0.5
            },
            "gamma_aug": {
                "enable": true,
                "probability": 0.5,
                "lower_limit": 0.5,
                "upper_limit": 2.0
            },
            "rotation_range": 5,
            "brightness_range": [
                0.5,
                1.5
            ],
            "occlusion_aug": {
                "max_aspect_ratio": 3.33,
                "max_area": 0.25,
                "enable": true,
                "probability": 0.5,
                "pixel_level": true,
                "min_area": 0.05,
                "min_pixel": 0,
                "max_pixel": 255,
                "min_aspect_ratio": 0.3
            },
            "horizontal_flip": true
        }
    },
    "evaluator": {
        "evaluation_exp_name": "results",
        "data_path": "/workspace/demo_3.0/forum_repro/gesturenet/data.json"
    }
}

Result: The val accuracy will reach about 0.7284.

loss: 9.7489 - categorical_accuracy: 0.6146 - val_loss: 9.5982 - val_categorical_accuracy: 0.7284

taoonvision · May 30, 2022, 7:23am

Thanks for the help, but I still got the problem of the loss staying the same. I will try it asap on an other machine with more compute capability.
However I still got a question: why loading pretrained weights and then add a new head doesn’t work properly? In the spec file you provided you set add_new_head: false and don’t provide any weights.

Morganh · May 30, 2022, 3:45pm

I have not tried the pretrained model yet. Will try later.

Morganh · May 31, 2022, 3:02am

Are you using my spec file?

yingliu · July 6, 2022, 6:35am

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

system · July 20, 2022, 6:36am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tao GestureNet train do not work properly TAO Toolkit	2	700	December 9, 2021
Training Custom Object detector with 6 classes TAO Toolkit	27	2337	October 12, 2021
Retrain TrafficCamNet with custom vehicle dataset using TLT 3.0 TAO Toolkit	10	1076	March 1, 2022
Add new class after transfer learning has already been done. And/or continue with different datasets and classes TAO Toolkit	5	522	October 9, 2021
Retrain TrafficCamNet using TLT TAO Toolkit	10	882	November 9, 2021
Tlt detectnet training focusing on a particular class? TAO Toolkit	16	1431	October 12, 2021
Add new images into dataset after pruning and retraining. TAO Toolkit	8	725	October 12, 2021
Training emotionnet with tao toolkit through Jupyter Notebook TAO Toolkit	26	1047	December 12, 2022
Retraining Trafficcamnet with custom vehicle dataset TAO Toolkit	30	2700	March 11, 2022
Retraining peoplenet model for detecting face and person only TAO Toolkit	4	408	October 12, 2021

Retraining Gesturenet

Related topics