Retraining Gesturenet

Please provide the following information when requesting support.

• Hardware (Ubuntu 18.04 PC with NVIDIA Quadro K610M)
• Network Type (GestureNet)

Hello,
I was trying to train GestureNet model while adding a new class which is the Y_P gesture in the HGR dataset but I can’t get the model to train well (cross_entropy_loss~=10). I am doing everything inside the jupyter notebook for GestureNet

• How to reproduce the issue ?
What I did, step by step:

  1. I modified “Y_P”: “rock” in convert_hgr_to_tlt_data.py to create my dataset with the new class rock. I ran convert_hgr_to_tlt_data.py to create tlt type dataset.
  2. In specs/dataset_experiment_config.json I changed class_weights to:
    "class_weights": {
        "random": 0.16,
        "thumbs_up": 0.16,
        "two": 0.16,
        "stop": 0.10,
        "ok": 0.10,
        "fist": 0.16,
      "rock": 0.16
    }
  1. I then ran command !tao gesturenet dataset_convert
  2. In specs/train_spec.json I changed num_classes, add_new_head and classes:

“add_new_head”: true
“classes”: {
“thumbs_up”: 0,
“fist”: 1,
“stop”: 2,
“ok”: 3,
“two”: 4,
“random”: 5,
“rock”: 6
},
“num_classes”: 7

  1. I then ran !tao gesturenet train -e $SPECS_DIR/train_spec.json \ -k $KEY . The problem here is that the loss stays around 10 and doesn’t get lower. Because of this loss, when evaluating with this trained model, I get these metrics which are really bad:

pred_label fist random two
gt_label
fist 22.2 77.8 0.0
ok 8.33 91.7 0.0
random 34.1 65.9 0.0
rock 0.0 1e+02 0.0
stop 1e+02 0.0 0.0
thumbs_up 60.0 0.0 40.0
two 0.0 1e+02 0.0
All 34.1 63.6 2.27

I also tried with the dataset without changes and it works well. However when I set add_new_head in train_spec file to True, the model get worst. Could someone light me up on what add_new_head really does?

Feel free to ask if you need more information. Thanks in advance.
Nicolas

Output of !tao gesturenet train -e $SPECS_DIR/train_spec.json \k $KEY:

Epoch 1/50
1/434 […] - ETA: 2:58 - loss: 16.2064 - categorical_accuracy: 0.0000e+00/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (0.366803). Check your callbacks.
% delta_t_median)
2/434 […] - ETA: 2:57 - loss: 16.2064 - categorical_accuracy: 0.0000e+00/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (0.183856). Check your callbacks.
% delta_t_median)
434/434 [==============================] - 27s 61ms/step - loss: 9.9892 - categorical_accuracy: 0.1659 - val_loss: 10.5450 - val_categorical_accuracy: 0.1643
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/callbacks.py:995: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

2022-05-19 08:31:13,069 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/callbacks.py:995: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

Epoch 2/50
434/434 [==============================] - 26s 59ms/step - loss: 10.4679 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
Epoch 3/50
434/434 [==============================] - 24s 56ms/step - loss: 10.4499 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
Epoch 4/50
434/434 [==============================] - 24s 55ms/step - loss: 10.4499 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
Epoch 5/50
434/434 [==============================] - 24s 56ms/step - loss: 10.4499 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
Epoch 6/50
434/434 [==============================] - 24s 56ms/step - loss: 10.4499 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
Epoch 7/50
434/434 [==============================] - 24s 55ms/step - loss: 10.4499 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
Epoch 8/50
434/434 [==============================] - 24s 56ms/step - loss: 10.4499 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
Epoch 9/50
434/434 [==============================] - 25s 59ms/step - loss: 10.4499 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
Epoch 10/50
434/434 [==============================] - 25s 57ms/step - loss: 10.4499 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
Epoch 11/50
434/434 [==============================] - 23s 53ms/step - loss: 10.4499 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
Epoch 12/50
434/434 [==============================] - 25s 58ms/step - loss: 10.4499 - categorical_accuracy: 0.1659 - val_loss: 7.0275 - val_categorical_accuracy: 0.1643
2022-05-19 08:35:47,521 [INFO] driveix.classifynet.trainer.classifynet_trainer: Finished training full model
2022-05-19 08:35:55,156 [INFO] driveix.classifynet.trainer.classifynet_trainer: Total Val Loss: 7.027486324310303
2022-05-19 08:35:55,156 [INFO] driveix.classifynet.trainer.classifynet_trainer: Total Val accuracy: 0.16428571939468384
2022-05-19 08:35:55,156 [INFO] main: Training finished successfully.
2022-05-19 10:35:56,505 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Only set toadd_new_head: false if you want to finetune on dataset with same gestures as pretrained model.

The add_new_head (bool) is a flag whether to add a new head or not. If set to True, it will remove last layer and load pretrained weights and then add new head.

Form the log, the loss seems to be always the same during training.
Did you see this behavior when you “tried with the dataset without changes and it works well.” ?

Thank you Morgan for the explanation about add_new_head, that’s what I thought.
No, during training on the normal dataset the loss was going down epoch after epoch as expected, I really don’t know where this problem could come from.
Tell me if I can send you any other information to solve this problem.

Best regards,
Nicolas

To narrow down, could you remove the pretrained model in the spec file and try to run again?

I don’t understand what you mean by “removing the pretrained model in the spec file”, you mean removing the weights from the train_spec file?

What I did is training the base model with base weights (gesturenet_vtrainable_v1.0/model.tlt) and on the base dataset.
The only parameter I changed from the basic config is add_new_head:True.

The training went well, here are the last epochs of the training:

Epoch 40/50
453/453 [==============================] - 29s 63ms/step - loss: 2.2338 - categorical_accuracy: 0.1634 - val_loss: 1.9925 - val_categorical_accuracy: 0.1951
Epoch 41/50
453/453 [==============================] - 28s 62ms/step - loss: 2.2316 - categorical_accuracy: 0.1634 - val_loss: 1.9915 - val_categorical_accuracy: 0.1951
Epoch 42/50
453/453 [==============================] - 28s 61ms/step - loss: 2.2297 - categorical_accuracy: 0.1634 - val_loss: 1.9905 - val_categorical_accuracy: 0.1951
Epoch 43/50
453/453 [==============================] - 28s 63ms/step - loss: 2.2287 - categorical_accuracy: 0.1634 - val_loss: 1.9895 - val_categorical_accuracy: 0.1951
Epoch 44/50
453/453 [==============================] - 28s 62ms/step - loss: 2.2287 - categorical_accuracy: 0.1634 - val_loss: 1.9886 - val_categorical_accuracy: 0.1951
Epoch 45/50
453/453 [==============================] - 29s 63ms/step - loss: 2.2260 - categorical_accuracy: 0.1634 - val_loss: 1.9878 - val_categorical_accuracy: 0.1951
Epoch 46/50
453/453 [==============================] - 28s 62ms/step - loss: 2.2228 - categorical_accuracy: 0.1634 - val_loss: 1.9870 - val_categorical_accuracy: 0.1951
Epoch 47/50
453/453 [==============================] - 28s 61ms/step - loss: 2.2230 - categorical_accuracy: 0.1634 - val_loss: 1.9862 - val_categorical_accuracy: 0.1951
Epoch 48/50
453/453 [==============================] - 28s 61ms/step - loss: 2.2217 - categorical_accuracy: 0.1634 - val_loss: 1.9855 - val_categorical_accuracy: 0.1951
Epoch 49/50
453/453 [==============================] - 27s 61ms/step - loss: 2.2213 - categorical_accuracy: 0.1634 - val_loss: 1.9848 - val_categorical_accuracy: 0.1951
Epoch 50/50
453/453 [==============================] - 29s 63ms/step - loss: 2.2208 - categorical_accuracy: 0.1634 - val_loss: 1.9841 - val_categorical_accuracy: 0.1951

However a problem happens during the evaluation, the model only predicts the class ‘stop’ for all images:

2022-05-23 06:57:26,891 [INFO] driveix.classifynet.evaluator.classifynet_evaluator: —Confusion Matrix—
2022-05-23 06:57:26,891 [INFO] driveix.classifynet.evaluator.classifynet_evaluator: pred_label stop
gt_label
fist 1e+02
ok 1e+02
random 1e+02
stop 1e+02
thumbs_up 1e+02
two 1e+02
All 1e+02

2022-05-23 06:57:27,135 [INFO] driveix.classifynet.evaluator.classifynet_evaluator: precision recall f1-score support

    fist     0.0000    0.0000    0.0000         9
      ok     0.0000    0.0000    0.0000        12
  random     0.0000    0.0000    0.0000        49
    stop     0.1023    1.0000    0.1856         9

thumbs_up 0.0000 0.0000 0.0000 5
two 0.0000 0.0000 0.0000 4

accuracy                         0.1023        88

macro avg 0.0170 0.1667 0.0309 88
weighted avg 0.0105 0.1023 0.0190 88

The model doesn’t have this problem when add_new_head is set to False. Do you have any clue why?

Best regards,
Nicolas

Yes.

The last training log is running without pretrained weights, right?

No, the previous logs were from a training with the pretrained weights (gesturenet_vtrainable_v1.0/model.tlt).

Here are the logs from a training without any weight specified in train_spec file:

2022-05-23 07:40:01,042 [INFO] driveix.classifynet.models.resnet_vanilla: Model loaded with random weight initilization.

Epoch 1/50
2/453 […] - ETA: 7:34 - loss: 7.4780 - categorical_accuracy: 0.0000e+00/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (0.420138). Check your callbacks.
% delta_t_median)
453/453 [==============================] - 63s 139ms/step - loss: 6.9864 - categorical_accuracy: 0.1810 - val_loss: 6.9593 - val_categorical_accuracy: 0.1382
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/callbacks.py:995: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

2022-05-23 07:41:07,498 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/callbacks.py:995: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

Epoch 2/50
453/453 [==============================] - 59s 131ms/step - loss: 6.9783 - categorical_accuracy: 0.1656 - val_loss: 6.9590 - val_categorical_accuracy: 0.1382
Epoch 3/50
453/453 [==============================] - 61s 135ms/step - loss: 6.9825 - categorical_accuracy: 0.1369 - val_loss: 6.9590 - val_categorical_accuracy: 0.1545
Epoch 4/50
453/453 [==============================] - 61s 134ms/step - loss: 6.9759 - categorical_accuracy: 0.1567 - val_loss: 6.9589 - val_categorical_accuracy: 0.1545
Epoch 5/50
453/453 [==============================] - 59s 131ms/step - loss: 6.9789 - categorical_accuracy: 0.1391 - val_loss: 6.9590 - val_categorical_accuracy: 0.1545
Epoch 6/50
453/453 [==============================] - 59s 130ms/step - loss: 6.9677 - categorical_accuracy: 0.1788 - val_loss: 6.9589 - val_categorical_accuracy: 0.1545
Epoch 7/50
453/453 [==============================] - 59s 130ms/step - loss: 6.9690 - categorical_accuracy: 0.1876 - val_loss: 6.9590 - val_categorical_accuracy: 0.1463
Epoch 8/50
453/453 [==============================] - 56s 125ms/step - loss: 6.9724 - categorical_accuracy: 0.1479 - val_loss: 6.9589 - val_categorical_accuracy: 0.1545
Epoch 9/50
453/453 [==============================] - 60s 134ms/step - loss: 6.9694 - categorical_accuracy: 0.1678 - val_loss: 6.9590 - val_categorical_accuracy: 0.1545
Epoch 10/50
453/453 [==============================] - 60s 133ms/step - loss: 6.9733 - categorical_accuracy: 0.1545 - val_loss: 6.9588 - val_categorical_accuracy: 0.1626
Epoch 11/50
453/453 [==============================] - 60s 132ms/step - loss: 6.9667 - categorical_accuracy: 0.1832 - val_loss: 6.9589 - val_categorical_accuracy: 0.1545
2022-05-23 07:51:19,588 [INFO] driveix.classifynet.trainer.classifynet_trainer: Finished training full model
2022-05-23 07:51:33,892 [INFO] driveix.classifynet.trainer.classifynet_trainer: Total Val Loss: 6.958917617797852
2022-05-23 07:51:33,892 [INFO] driveix.classifynet.trainer.classifynet_trainer: Total Val accuracy: 0.15447154641151428
2022-05-23 07:51:33,893 [INFO] main: Training finished successfully.
2022-05-23 09:51:36,301 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Still have the same problem:

2022-05-23 07:55:05,355 [INFO] driveix.classifynet.evaluator.classifynet_evaluator: Calculating per-class P/R and confusion matrix.
2022-05-23 07:55:13,053 [INFO] driveix.classifynet.evaluator.classifynet_evaluator: —Confusion Matrix—
2022-05-23 07:55:13,053 [INFO] driveix.classifynet.evaluator.classifynet_evaluator: pred_label thumbs_up
gt_label
fist 1e+02
ok 1e+02
random 1e+02
stop 1e+02
thumbs_up 1e+02
two 1e+02
All 1e+02

Thanks for the quick answer.

Please share the full training spec with us when you train 7 classes.
I will try to reproduce on my side.

My train_spec.json file:

{
“random_seed”: 108,
“batch_size”: 1,
“output_experiments_fld”: “/workspace/tao-experiments/gesturenet/”,
“save_weights_path”: “model”,
“trainer”: {
“class”: “ClassifyNetTrainer”,
“module”: “driveix.classifynet.trainer.classifynet_trainer”,
“top_training”: {
“stage_order”: 1,
“loss_fn”: “categorical_crossentropy”,
“train_epochs”: 0,
“num_layers_unfreeze”: 0,
“optimizer”: “rmsprop”
},
“finetuning”: {
“stage_order”: 2,
“train_epochs”: 50,
“loss_fn”: “categorical_crossentropy”,
“initial_lrate”: 5e-05,
“decay_step_size”: 33,
“lr_drop_rate”: 0.5,
“enable_checkpointing”: true,
“num_layers_unfreeze”: 3,
“optimizer”: “sgd”
},
“num_workers”: 1
},
“model”: {
“image_height”: 160,
“image_width”: 160,
“gray_scale_input”: false,
“data_format”: “channels_last”,
“base_model”: “resnet_vanilla”,
“num_layers”: 18,
“use_batch_norm”: true,
“weights_init”: “/workspace/tao-experiments/gesturenet/pretrained_models/gesturenet_vtrainable_v1.0/model.tlt”,
“add_new_head”: true,
“kernel_regularizer_type”: “l2”,
“kernel_regularization_factor”: 0.001
},
“dataset”: {
“image_root_path”: “/workspace/tao-experiments/gesturenet/”,
“classes”: {
“thumbs_up”: 0,
“fist”: 1,
“stop”: 2,
“ok”: 3,
“two”: 4,
“random”: 5,
“rock”: 6
},
“data_path”: “/workspace/tao-experiments/gesturenet/data.json”,
“num_classes”: 7,
“augmentation”: {
“shear_range”: 0.0,
“color_pca_aug”: {
“enable”: false,
“probability”: 0.5
},
“gamma_aug”: {
“enable”: true,
“probability”: 0.5,
“lower_limit”: 0.5,
“upper_limit”: 2.0
},
“rotation_range”: 5,
“brightness_range”: [
0.5,
1.5
],
“occlusion_aug”: {
“max_aspect_ratio”: 3.33,
“max_area”: 0.25,
“enable”: true,
“probability”: 0.5,
“pixel_level”: true,
“min_area”: 0.05,
“min_pixel”: 0,
“max_pixel”: 255,
“min_aspect_ratio”: 0.3
},
“horizontal_flip”: true
}
},
“evaluator”: {
“evaluation_exp_name”: “results”,
“data_path”: “/workspace/tao-experiments/gesturenet/data.json”
}
}

If you want to use the same dataset as I did (with the new class rock) please refer to the first message I posted describing how I did it.
Thanks Morgan.

Hello Morgan,
Did you manage to reproduce the problem on your computer?
Best regards,
Nicolas

Sorry, I have not tried yet. Will run it later.

No problem, thank you for your responsiveness. Keep me informed.

With your steps, I can reproduce your result. Please modify spec similar to below.

{
    "random_seed": 108,
    "batch_size": 2,
    "output_experiments_fld": "/workspace/demo_3.0/forum_repro/gesturenet/",
    "save_weights_path": "model",
    "trainer": {
        "class": "ClassifyNetTrainer",
        "module": "driveix.classifynet.trainer.classifynet_trainer",
        "top_training": {
            "stage_order": 1,
            "loss_fn": "categorical_crossentropy",
            "train_epochs": 1,
            "num_layers_unfreeze": 11,
            "optimizer": "rmsprop"
        },
        "finetuning": {
            "stage_order": 2,
            "train_epochs": 500,
            "loss_fn": "categorical_crossentropy",
            "initial_lrate": 5e-05,
            "decay_step_size": 33,
            "lr_drop_rate": 0.5 ,
            "enable_checkpointing": true,
            "num_layers_unfreeze": 100,
            "optimizer": "sgd"
        },
        "num_workers": 1
    },
    "model": {
        "image_height": 160,
        "image_width": 160,
        "gray_scale_input": false,
        "data_format": "channels_first",
        "base_model": "resnet_vanilla",
        "num_layers": 34,
        "use_batch_norm": false,
        "weights_init": "",
        "add_new_head": false,
 
        "kernel_regularizer_type": "l2",
        "kernel_regularization_factor": 0.001
    },
    "dataset": {
        "image_root_path": "/workspace/demo_3.0/forum_repro/gesturenet",
        "classes": {
            "thumbs_up": 0,
            "fist": 1,
            "stop": 2,
            "ok": 3,
            "two": 4,
            "random": 5,
            "rock": 6
        },
        "data_path": "/workspace/demo_3.0/forum_repro/gesturenet/data.json",
        "num_classes": 7,
        "augmentation": {
            "shear_range": 0.0,
            "color_pca_aug": {
                "enable": false,
                "probability": 0.5
            },
            "gamma_aug": {
                "enable": true,
                "probability": 0.5,
                "lower_limit": 0.5,
                "upper_limit": 2.0
            },
            "rotation_range": 5,
            "brightness_range": [
                0.5,
                1.5
            ],
            "occlusion_aug": {
                "max_aspect_ratio": 3.33,
                "max_area": 0.25,
                "enable": true,
                "probability": 0.5,
                "pixel_level": true,
                "min_area": 0.05,
                "min_pixel": 0,
                "max_pixel": 255,
                "min_aspect_ratio": 0.3
            },
            "horizontal_flip": true
        }
    },
    "evaluator": {
        "evaluation_exp_name": "results",
        "data_path": "/workspace/demo_3.0/forum_repro/gesturenet/data.json"
    }
}

Result: The val accuracy will reach about 0.7284.

loss: 9.7489 - categorical_accuracy: 0.6146 - val_loss: 9.5982 - val_categorical_accuracy: 0.7284

Thanks for the help, but I still got the problem of the loss staying the same. I will try it asap on an other machine with more compute capability.
However I still got a question: why loading pretrained weights and then add a new head doesn’t work properly? In the spec file you provided you set add_new_head: false and don’t provide any weights.

I have not tried the pretrained model yet. Will try later.

Are you using my spec file?

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.