Some issues regarding running TAO Colab sample codes

Well, now it seems to be PyTorch’s turn…
I tried to train my model using ActionRecognitonNet sample code and tao train didn’t work succesfully.

Train RGB only model with PTM
[NeMo W 2023-07-31 10:07:18 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-07-31 10:07:23 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-07-31 10:07:23 nemo_logging:349] /usr/local/lib/python3.7/dist-packages/nvidia_tao_pytorch/cv/action_recognition/scripts/train.py:81: UserWarning: 
    'train_rgb_3d_finetune.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
Created a temporary directory at /tmp/tmp57si20o_
Writing /tmp/tmp57si20o_/_remote_module_non_scriptable.py
loading trained weights from /content/results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt
Error executing job with overrides: ['output_dir=/content/results/rgb_3d_ptm', 'encryption_key=nvidia_tao', 'model_config.rgb_pretrained_model_path=/content/results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt', 'model_config.rgb_pretrained_num_classes=2']
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 371, in <lambda>
    overrides=args.overrides,
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/hydra.py", line 110, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.7/dist-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.7/dist-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "<frozen cv.action_recognition.scripts.train>", line 77, in main
  File "<frozen cv.action_recognition.scripts.train>", line 28, in run_experiment
  File "<frozen cv.action_recognition.model.pl_ar_model>", line 33, in __init__
  File "<frozen cv.action_recognition.model.pl_ar_model>", line 39, in _build_model
  File "<frozen cv.action_recognition.model.build_nn_model>", line 82, in build_ar_model
  File "<frozen cv.action_recognition.model.ar_model>", line 105, in get_basemodel3d
  File "<frozen cv.action_recognition.model.resnet3d>", line 366, in resnet3d
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1672, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ResNet3d:
	size mismatch for fc_cls.weight: copying a param with shape torch.Size([5, 512]) from checkpoint, the shape in current model is torch.Size([2, 512]).
	size mismatch for fc_cls.bias: copying a param with shape torch.Size([5]) from checkpoint, the shape in current model is torch.Size([2]).

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</usr/local/lib/python3.7/dist-packages/nvidia_tao_pytorch/cv/action_recognition/scripts/train.py>", line 3, in <module>
  File "<frozen cv.action_recognition.scripts.train>", line 81, in <module>
  File "<frozen cv.super_resolution.scripts.configs.hydra_runner>", line 103, in wrapper
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 368, in _run_hydra
    lambda: hydra.run(
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 251, in run_and_report
    assert mdl is not None
AssertionError
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: 'str' object has no attribute 'decode'
Execution status: FAIL

Can you share your training spec file?
Did you run default colab notebook?

train_rgb_3d_finetune.yaml (811 Bytes)

I only modified the content related to labels as I tried to train a model with both smoking and standing classes.

Please train from scratch since the pretrained model in ngc is training for 5 classes.

Oh, I’ll try it again.
I was expecting that I could use the pretrained model to train my own model with different classes by transfer learning like I did in the Multi-class Image Classification sample code, where I trained my 2-class model using the provided pretrained model.

I shouldn’t have modified this from 5 to 2, which was what made the error message shown happen.

model_config.rgb_pretrained_num_classes=5

Aside from that, I still had to do the following command so that the training would work.
!pip install transformers -U

Thanks for your help.

When you set model_config.rgb_pretrained_num_classes=5, it is running training now, right?

Yes. I had modified it to 2 until I realized that I shouldn’t have done that.

I’m now trying resnet2d to train another model and it’s working.

Somehow I may still try to figure out if it’s possible to change the backbone netwrok from resnet to something like mobilenet_v1 in this task.

For backbone, please refer to ActionRecognitionNet - NVIDIA Docs

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.