Some issues regarding running TAO Colab sample codes

silentjcr · July 31, 2023, 4:19pm

Well, now it seems to be PyTorch’s turn…
I tried to train my model using ActionRecognitonNet sample code and tao train didn’t work succesfully.

Train RGB only model with PTM
[NeMo W 2023-07-31 10:07:18 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-07-31 10:07:23 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-07-31 10:07:23 nemo_logging:349] /usr/local/lib/python3.7/dist-packages/nvidia_tao_pytorch/cv/action_recognition/scripts/train.py:81: UserWarning: 
    'train_rgb_3d_finetune.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
Created a temporary directory at /tmp/tmp57si20o_
Writing /tmp/tmp57si20o_/_remote_module_non_scriptable.py
loading trained weights from /content/results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt
Error executing job with overrides: ['output_dir=/content/results/rgb_3d_ptm', 'encryption_key=nvidia_tao', 'model_config.rgb_pretrained_model_path=/content/results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt', 'model_config.rgb_pretrained_num_classes=2']
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 371, in <lambda>
    overrides=args.overrides,
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/hydra.py", line 110, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.7/dist-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.7/dist-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "<frozen cv.action_recognition.scripts.train>", line 77, in main
  File "<frozen cv.action_recognition.scripts.train>", line 28, in run_experiment
  File "<frozen cv.action_recognition.model.pl_ar_model>", line 33, in __init__
  File "<frozen cv.action_recognition.model.pl_ar_model>", line 39, in _build_model
  File "<frozen cv.action_recognition.model.build_nn_model>", line 82, in build_ar_model
  File "<frozen cv.action_recognition.model.ar_model>", line 105, in get_basemodel3d
  File "<frozen cv.action_recognition.model.resnet3d>", line 366, in resnet3d
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1672, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ResNet3d:
	size mismatch for fc_cls.weight: copying a param with shape torch.Size([5, 512]) from checkpoint, the shape in current model is torch.Size([2, 512]).
	size mismatch for fc_cls.bias: copying a param with shape torch.Size([5]) from checkpoint, the shape in current model is torch.Size([2]).

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</usr/local/lib/python3.7/dist-packages/nvidia_tao_pytorch/cv/action_recognition/scripts/train.py>", line 3, in <module>
  File "<frozen cv.action_recognition.scripts.train>", line 81, in <module>
  File "<frozen cv.super_resolution.scripts.configs.hydra_runner>", line 103, in wrapper
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 368, in _run_hydra
    lambda: hydra.run(
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 251, in run_and_report
    assert mdl is not None
AssertionError
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: 'str' object has no attribute 'decode'
Execution status: FAIL

Morganh · August 1, 2023, 2:57am

Can you share your training spec file?
Did you run default colab notebook?

silentjcr · August 1, 2023, 3:05am

train_rgb_3d_finetune.yaml (811 Bytes)

I only modified the content related to labels as I tried to train a model with both smoking and standing classes.

Morganh · August 1, 2023, 3:38am

silentjcr:

RuntimeError: Error(s) in loading state_dict for ResNet3d:
	size mismatch for fc_cls.weight: copying a param with shape torch.Size([5, 512]) from checkpoint, the shape in current model is torch.Size([2, 512]).
	size mismatch for fc_cls.bias: copying a param with shape torch.Size([5]) from checkpoint, the shape in current model is torch.Size([2]).

Please train from scratch since the pretrained model in ngc is training for 5 classes.

silentjcr · August 1, 2023, 3:48am

Oh, I’ll try it again.
I was expecting that I could use the pretrained model to train my own model with different classes by transfer learning like I did in the Multi-class Image Classification sample code, where I trained my 2-class model using the provided pretrained model.

silentjcr · August 1, 2023, 4:24am

I shouldn’t have modified this from 5 to 2, which was what made the error message shown happen.

model_config.rgb_pretrained_num_classes=5

Aside from that, I still had to do the following command so that the training would work.
!pip install transformers -U

Thanks for your help.

Morganh · August 1, 2023, 5:11am

When you set model_config.rgb_pretrained_num_classes=5, it is running training now, right?

silentjcr · August 1, 2023, 6:06am

Yes. I had modified it to 2 until I realized that I shouldn’t have done that.

I’m now trying resnet2d to train another model and it’s working.

Somehow I may still try to figure out if it’s possible to change the backbone netwrok from resnet to something like mobilenet_v1 in this task.

Morganh · August 1, 2023, 6:10am

For backbone, please refer to ActionRecognitionNet - NVIDIA Docs

system · August 15, 2023, 6:10am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TAO toolkit 4.0 actionrecognitionnet training error TAO Toolkit	5	613	January 17, 2023
TAO toolkit 5.3 actionrecognitionnet training error for joint model, network shape mismatch TAO Toolkit	5	30	December 5, 2024
Issues regarding training of ActionRecognitionNet in TAO 5.0.0 TAO Toolkit	4	556	August 30, 2023
Add class to pretrained model using TAO 5.5.0 TAO Toolkit	2	19	February 4, 2025
TAO action recogniton net trainning extremely slow TAO Toolkit tao	20	636	August 7, 2023
Error while training ActionRecognitionNet with TAO TAO Toolkit	14	1507	February 8, 2022
TAO toolkit 4.0 actionrecognitionnet training error TAO Toolkit	5	385	August 18, 2023
OCDNet Tao Model Zoo TAO Toolkit jetson	7	37	October 22, 2024
Can't run the provided TAO toolkit sample code TAO Toolkit	20	1985	July 31, 2023
Error in TAO-Toolkit while training TAO Toolkit	15	1505	July 6, 2022

Some issues regarding running TAO Colab sample codes

Related topics