Well, now it seems to be PyTorch’s turn…
I tried to train my model using ActionRecognitonNet sample code and tao train didn’t work succesfully.
Train RGB only model with PTM
[NeMo W 2023-07-31 10:07:18 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-07-31 10:07:23 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-07-31 10:07:23 nemo_logging:349] /usr/local/lib/python3.7/dist-packages/nvidia_tao_pytorch/cv/action_recognition/scripts/train.py:81: UserWarning:
'train_rgb_3d_finetune.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
Created a temporary directory at /tmp/tmp57si20o_
Writing /tmp/tmp57si20o_/_remote_module_non_scriptable.py
loading trained weights from /content/results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt
Error executing job with overrides: ['output_dir=/content/results/rgb_3d_ptm', 'encryption_key=nvidia_tao', 'model_config.rgb_pretrained_model_path=/content/results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt', 'model_config.rgb_pretrained_num_classes=2']
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 371, in <lambda>
overrides=args.overrides,
File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/hydra.py", line 110, in run
_ = ret.return_value
File "/usr/local/lib/python3.7/dist-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/usr/local/lib/python3.7/dist-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "<frozen cv.action_recognition.scripts.train>", line 77, in main
File "<frozen cv.action_recognition.scripts.train>", line 28, in run_experiment
File "<frozen cv.action_recognition.model.pl_ar_model>", line 33, in __init__
File "<frozen cv.action_recognition.model.pl_ar_model>", line 39, in _build_model
File "<frozen cv.action_recognition.model.build_nn_model>", line 82, in build_ar_model
File "<frozen cv.action_recognition.model.ar_model>", line 105, in get_basemodel3d
File "<frozen cv.action_recognition.model.resnet3d>", line 366, in resnet3d
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1672, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ResNet3d:
size mismatch for fc_cls.weight: copying a param with shape torch.Size([5, 512]) from checkpoint, the shape in current model is torch.Size([2, 512]).
size mismatch for fc_cls.bias: copying a param with shape torch.Size([5]) from checkpoint, the shape in current model is torch.Size([2]).
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "</usr/local/lib/python3.7/dist-packages/nvidia_tao_pytorch/cv/action_recognition/scripts/train.py>", line 3, in <module>
File "<frozen cv.action_recognition.scripts.train>", line 81, in <module>
File "<frozen cv.super_resolution.scripts.configs.hydra_runner>", line 103, in wrapper
File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 368, in _run_hydra
lambda: hydra.run(
File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 251, in run_and_report
assert mdl is not None
AssertionError
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: 'str' object has no attribute 'decode'
Execution status: FAIL