TAO toolkit 5.3 actionrecognitionnet training error for joint model, network shape mismatch

Please provide the following information when requesting support.

• Hardware : T4
• Network Type : ActionRecognitionNet
• Training spec file(If have, please share here)

results_dir: /results/joint_2d
encryption_key: nvidia_tao
model:
model_type: joint
backbone: resnet_18
rgb_seq_length: 32
rgb_pretrained_model_path: /workspace/tao-experiments/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_2d_rgb_hmdb5_32.tlt
input_height: 224
of_seq_length: 32
of_pretrained_model_path: /workspace/tao-experiments/pretrained/resnet18_2d_of_hmdb5_32_a100.tlt
input_width: 224
input_type: 2d
sample_strategy: consecutive
dropout_ratio: 0.0
of_pretrained_num_classes: 2
rgb_pretrained_num_classes: 2
dataset:
train_dataset_dir: /data/train
val_dataset_dir: /data/test
label_map:
normal: 0
shoplifting: 1
batch_size: 16
workers: 4
clips_per_video: 5
augmentation_config:
train_crop_type: no_crop
horizontal_flip_prob: 0.5
rgb_input_mean: [0.5]
rgb_input_std: [0.5]
val_center_crop: False
train:
optim:
lr: 0.001
momentum: 0.9
weight_decay: 0.0001
lr_scheduler: MultiStep
lr_steps: [5, 15, 20]
lr_decay: 0.1
num_epochs: 20
checkpoint_interval: 1
evaluate:
checkpoint: “??”
test_dataset_dir: “??”
inference:
checkpoint: “??”
inference_dataset_dir: “??”
export:
checkpoint: “??”
• How to reproduce the issue ?
(launcher) root:~/training$ tao model action_recognition train -e /specs/train_joint_2d.yaml -k nvidia_tao results_dir=/results/joint_2d
2024-11-02 04:15:47,317 [TAO Toolkit] [INFO] root 160:
2024-11-02 04:15:47,407 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container:
2024-11-02 04:15:47,435 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 293:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/contact/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
2024-11-02 04:15:47,435 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
sys:1: UserWarning:
‘train_joint_2d.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning:
‘train_joint_2d.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
_run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
loading trained weights from /workspace/tao-experiments/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_2d_rgb_hmdb5_32.tlt
Error executing job with overrides: [‘encryption_key=nvidia_tao’, ‘results_dir=/results/joint_2d’]
Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/scripts/train.py”, line 142, in main
raise e
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/scripts/train.py”, line 124, in main
run_experiment(experiment_config=cfg,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/scripts/train.py”, line 40, in run_experiment
ar_model = ActionRecognitionModel(experiment_config)
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/model/pl_ar_model.py”, line 49, in init
self._build_model(experiment_spec, export)
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/model/pl_ar_model.py”, line 68, in _build_model
self.model = build_ar_model(experiment_config=experiment_spec,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/model/build_nn_model.py”, line 76, in build_ar_model
model = JointModel(backbone=backbone,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/model/ar_model.py”, line 225, in init
self.model_rgb = get_basemodel(backbone=backbone,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/model/ar_model.py”, line 110, in get_basemodel
model = resnet2d(backbone=backbone,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/model/resnet.py”, line 276, in resnet2d
model.load_state_dict(model_dict)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 2152, in load_state_dict
raise RuntimeError(‘Error(s) in loading state_dict for {}:\n\t{}’.format(
RuntimeError: Error(s) in loading state_dict for ResNet:
size mismatch for fc_cls.weight: copying a param with shape torch.Size([5, 512]) from checkpoint, the shape in current model is torch.Size([2, 512]).
size mismatch for fc_cls.bias: copying a param with shape torch.Size([5]) from checkpoint, the shape in current model is torch.Size([2]).

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[2024-11-02 04:16:03,741 - TAO Toolkit - root - ERROR] Execution status: FAIL
2024-11-02 04:16:04,665 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

I tried to train the joint ActionRecognitionNet model to recognize 2 classes, but I get this shape mismatch error. I have successfully trained the rgb only model and I tried changing the num of pretrained classes on the config file to 5, but I got the same error.

Because the pretrained model in ngc has 5 classes. Please train from scratch for your 2 classes.
Similar topoic is Some issues regarding running TAO Colab sample codes - #4 by silentjcr.

Thanks. That works. I am also curious, if you plan on supporting joint or OF model on Deepstream? Right now it is throwing an error and does not seem like it supports Optical Flow at all. Is there a work around?

Refer to https://docs.nvidia.com/tao/tao-toolkit/text/cv_finetuning/pytorch/action_recognition_net.html#deploying-the-actionrecognitionnet-in-the-deepstream-sample, this network supports the following input options: RGB-only input, optical flow (OF) only input, and two-stream joint input (RGB+OF).
Refer to the sample applications documentation for detailed steps to run action recognition in DeepStream.
If you meet issue when run it in Deepstream, please create a topic in Deepstream forum. Thanks.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks