TAO toolkit 5.3 actionrecognitionnet training error for joint model, network shape mismatch

enloq · November 2, 2024, 5:15am

Please provide the following information when requesting support.

• Hardware : T4
• Network Type : ActionRecognitionNet
• Training spec file(If have, please share here)

results_dir: /results/joint_2d
encryption_key: nvidia_tao
model:
model_type: joint
backbone: resnet_18
rgb_seq_length: 32
rgb_pretrained_model_path: /workspace/tao-experiments/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_2d_rgb_hmdb5_32.tlt
input_height: 224
of_seq_length: 32
of_pretrained_model_path: /workspace/tao-experiments/pretrained/resnet18_2d_of_hmdb5_32_a100.tlt
input_width: 224
input_type: 2d
sample_strategy: consecutive
dropout_ratio: 0.0
of_pretrained_num_classes: 2
rgb_pretrained_num_classes: 2
dataset:
train_dataset_dir: /data/train
val_dataset_dir: /data/test
label_map:
normal: 0
shoplifting: 1
batch_size: 16
workers: 4
clips_per_video: 5
augmentation_config:
train_crop_type: no_crop
horizontal_flip_prob: 0.5
rgb_input_mean: [0.5]
rgb_input_std: [0.5]
val_center_crop: False
train:
optim:
lr: 0.001
momentum: 0.9
weight_decay: 0.0001
lr_scheduler: MultiStep
lr_steps: [5, 15, 20]
lr_decay: 0.1
num_epochs: 20
checkpoint_interval: 1
evaluate:
checkpoint: “??”
test_dataset_dir: “??”
inference:
checkpoint: “??”
inference_dataset_dir: “??”
export:
checkpoint: “??”
• How to reproduce the issue ?
(launcher) root:~/training$ tao model action_recognition train -e /specs/train_joint_2d.yaml -k nvidia_tao results_dir=/results/joint_2d
2024-11-02 04:15:47,317 [TAO Toolkit] [INFO] root 160:
2024-11-02 04:15:47,407 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container:
2024-11-02 04:15:47,435 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 293:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/contact/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
2024-11-02 04:15:47,435 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
sys:1: UserWarning:
‘train_joint_2d.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning:
‘train_joint_2d.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
_run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
loading trained weights from /workspace/tao-experiments/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_2d_rgb_hmdb5_32.tlt
Error executing job with overrides: [‘encryption_key=nvidia_tao’, ‘results_dir=/results/joint_2d’]
Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/scripts/train.py”, line 142, in main
raise e
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/scripts/train.py”, line 124, in main
run_experiment(experiment_config=cfg,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/scripts/train.py”, line 40, in run_experiment
ar_model = ActionRecognitionModel(experiment_config)
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/model/pl_ar_model.py”, line 49, in init
self._build_model(experiment_spec, export)
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/model/pl_ar_model.py”, line 68, in _build_model
self.model = build_ar_model(experiment_config=experiment_spec,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/model/build_nn_model.py”, line 76, in build_ar_model
model = JointModel(backbone=backbone,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/model/ar_model.py”, line 225, in init
self.model_rgb = get_basemodel(backbone=backbone,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/model/ar_model.py”, line 110, in get_basemodel
model = resnet2d(backbone=backbone,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/model/resnet.py”, line 276, in resnet2d
model.load_state_dict(model_dict)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 2152, in load_state_dict
raise RuntimeError(‘Error(s) in loading state_dict for {}:\n\t{}’.format(
RuntimeError: Error(s) in loading state_dict for ResNet:
size mismatch for fc_cls.weight: copying a param with shape torch.Size([5, 512]) from checkpoint, the shape in current model is torch.Size([2, 512]).
size mismatch for fc_cls.bias: copying a param with shape torch.Size([5]) from checkpoint, the shape in current model is torch.Size([2]).

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[2024-11-02 04:16:03,741 - TAO Toolkit - root - ERROR] Execution status: FAIL
2024-11-02 04:16:04,665 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

I tried to train the joint ActionRecognitionNet model to recognize 2 classes, but I get this shape mismatch error. I have successfully trained the rgb only model and I tried changing the num of pretrained classes on the config file to 5, but I got the same error.

Morganh · November 2, 2024, 3:48pm

Because the pretrained model in ngc has 5 classes. Please train from scratch for your 2 classes.
Similar topoic is Some issues regarding running TAO Colab sample codes - #4 by silentjcr.

enloq · November 5, 2024, 6:42am

Thanks. That works. I am also curious, if you plan on supporting joint or OF model on Deepstream? Right now it is throwing an error and does not seem like it supports Optical Flow at all. Is there a work around?

Morganh · November 5, 2024, 9:48am

Refer to https://docs.nvidia.com/tao/tao-toolkit/text/cv_finetuning/pytorch/action_recognition_net.html#deploying-the-actionrecognitionnet-in-the-deepstream-sample, this network supports the following input options: RGB-only input, optical flow (OF) only input, and two-stream joint input (RGB+OF).
Refer to the sample applications documentation for detailed steps to run action recognition in DeepStream.
If you meet issue when run it in Deepstream, please create a topic in Deepstream forum. Thanks.

yingliu · December 5, 2024, 2:39am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

system · December 19, 2024, 2:40am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error in TAO-Toolkit while training TAO Toolkit	15	1506	July 6, 2022
Add class to pretrained model using TAO 5.5.0 TAO Toolkit	2	19	February 4, 2025
Issue Running Inference on NVIDIA TAO Retail Object Recognition Model TAO Toolkit python , tao , retail-object-detection	4	41	February 21, 2025
Error while training ActionRecognitionNet with TAO TAO Toolkit	14	1507	February 8, 2022
Errors during training in TAO TAO Toolkit	3	391	January 6, 2024
Issues regarding training of ActionRecognitionNet in TAO 5.0.0 TAO Toolkit	4	556	August 30, 2023
Some issues regarding running TAO Colab sample codes TAO Toolkit	9	367	August 1, 2023
OCDNet Tao Model Zoo TAO Toolkit jetson	7	37	October 22, 2024
Probelm as running visual_changenet_classification on TAO launcher TAO Toolkit	41	1026	November 21, 2023
Classification_pyt error TAO Toolkit jetson	16	87	September 18, 2024

TAO toolkit 5.3 actionrecognitionnet training error for joint model, network shape mismatch

Related topics