Erorr when training the model using TAO for custom action recognitinon net

• Hardware (T4/V100/Xavier/Nano/etc)
tesla t4

• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
action_recognition_net

• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
format_version: 2.0
toolkit_version: 4.0.1
published_date: 03/06/2023

• Training spec file(If have, please share here)

train_rgb_3d_finetune.yaml:

output_dir: /results/rgb_3d_ptm
encryption_key: nvidia_tao
model_config:
  model_type: rgb
  backbone: resnet18
  rgb_seq_length: 3
  input_type: 3d
  sample_strategy: consecutive
  dropout_ratio: 0.0
train_config:
  optim:
    lr: 0.001
    momentum: 0.9
    weight_decay: 0.0001
    lr_scheduler: MultiStep
    lr_steps: [5, 15, 20]
    lr_decay: 0.1
  epochs: 20
  checkpoint_interval: 1
dataset_config:
  train_dataset_dir: /data/train
  val_dataset_dir: /data/test
  label_map:
    throw: 0
  output_shape:
  - 224
  - 224
  batch_size: 32
  workers: 8
  clips_per_video: 5
  augmentation_config:
    train_crop_type: no_crop
    horizontal_flip_prob: 0.5
    rgb_input_mean: [0.5]
    rgb_input_std: [0.5]
    val_center_crop: False

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

CLI:

print("Train RGB only model with PTM")
!tao action_recognition train \
                  -e $SPECS_DIR/train_rgb_3d_finetune.yaml \
                  -r $RESULTS_DIR/rgb_3d_ptm \
                  -k $KEY \
                  model_config.rgb_pretrained_model_path=$RESULTS_DIR/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt  \
                  model_config.rgb_pretrained_num_classes=5

I tried to train the model with custom action recognition for ‘throw’ but it is giving me this error as shown below:

Error Log:

Train RGB only model with PTM
2023-05-23 11:11:12,944 [INFO] root: Registry: ['nvcr.io']
2023-05-23 11:11:12,998 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-pyt
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
[NeMo W 2023-05-23 11:11:26 nemo_logging:349] <frozen cv.action_recognition.scripts.train>:81: UserWarning: 
    'train_rgb_3d_finetune.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
Created a temporary directory at /tmp/tmpuro6y_xl
Writing /tmp/tmpuro6y_xl/_remote_module_non_scriptable.py
loading trained weights from /results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt
ResNet3d(
  (conv1): Conv3d(3, 64, kernel_size=(5, 7, 7), stride=(2, 2, 2), padding=(2, 3, 3), bias=False)
  (bn1): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool3d(kernel_size=(1, 3, 3), stride=2, padding=(0, 1, 1), dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock3d(
      (conv1): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock3d(
      (conv1): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer2): Sequential(
    (0): BasicBlock3d(
      (conv1): Conv3d(64, 128, kernel_size=(3, 3, 3), stride=(1, 2, 2), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(128, 128, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv3d(64, 128, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
        (1): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock3d(
      (conv1): Conv3d(128, 128, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(128, 128, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer3): Sequential(
    (0): BasicBlock3d(
      (conv1): Conv3d(128, 256, kernel_size=(3, 3, 3), stride=(1, 2, 2), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(256, 256, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv3d(128, 256, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
        (1): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock3d(
      (conv1): Conv3d(256, 256, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(256, 256, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer4): Sequential(
    (0): BasicBlock3d(
      (conv1): Conv3d(256, 512, kernel_size=(3, 3, 3), stride=(1, 2, 2), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(512, 512, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv3d(256, 512, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
        (1): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock3d(
      (conv1): Conv3d(512, 512, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(512, 512, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (avg_pool): AdaptiveAvgPool3d(output_size=(1, 1, 1))
  (fc_cls): Linear(in_features=512, out_features=1, bias=True)
)
[NeMo W 2023-05-23 11:11:31 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/callback_connector.py:151: LightningDeprecationWarning: Setting `Trainer(checkpoint_callback=False)` is deprecated in v1.5 and will be removed in v1.7. Please consider using `Trainer(enable_checkpointing=False)`.
      rank_zero_deprecation(
    
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: /results/rgb_3d_ptm/lightning_logs
Train dataset samples: 70
Validation dataset samples: 30
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Adjusting learning rate of group 0 to 1.0000e-03.

  | Name           | Type     | Params
--------------------------------------------
0 | model          | ResNet3d | 33.2 M
1 | train_accuracy | Accuracy | 0     
2 | val_accuracy   | Accuracy | 0     
--------------------------------------------
33.2 M    Trainable params
0         Non-trainable params
33.2 M    Total params
132.742   Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s][NeMo W 2023-05-23 11:11:31 nemo_logging:349] <frozen cv.action_recognition.dataloader.frame_sampler>:63: RuntimeWarning: divide by zero encountered in remainder
    
Error executing job with overrides: ['output_dir=/results/rgb_3d_ptm', 'encryption_key=nvidia_tao', 'model_config.rgb_pretrained_model_path=/results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt', 'model_config.rgb_pretrained_num_classes=5']
An error occurred during Hydra's exception formatting:
AssertionError()
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 252, in run_and_report
    assert mdl is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/cv/action_recognition/scripts/train.py>", line 3, in <module>
  File "<frozen cv.action_recognition.scripts.train>", line 81, in <module>
  File "<frozen cv.super_resolution.scripts.configs.hydra_runner>", line 99, in wrapper
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 294, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "<frozen cv.action_recognition.scripts.train>", line 75, in main
  File "<frozen cv.action_recognition.scripts.train>", line 64, in run_experiment
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
    self._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 724, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in _run
    results = self._run_stage()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1324, in _run_stage
    return self._run_train()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1346, in _run_train
    self._run_sanity_check()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1414, in _run_sanity_check
    val_loop.run()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 153, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 111, in advance
    batch = next(data_fetcher)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in __next__
    return self.fetching_function()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 259, in fetching_function
    self._fetch_next_batch(self.dataloader_iter)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 273, in _fetch_next_batch
    batch = next(iterator)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1374, in _next_data
    return self._process_data(data)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1400, in _process_data
    data.reraise()
  File "/opt/conda/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "<frozen cv.action_recognition.dataloader.ar_dataset>", line 239, in __getitem__
  File "<frozen cv.action_recognition.dataloader.ar_dataset>", line 203, in get_frames
IndexError: list index out of range

Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2023-05-23 11:11:36,292 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Please check your dataset folder. Refer to the similar error and solution in Error while training ActionRecognitionNet with TAO.

Thanks it is my dataset problem but why is the command not running properly to process the raw_videos to images.

!cd tao_toolkit_recipes/tao_action_recognition/data_generation/ && bash ./preprocess_HMDB_RGB.sh $HOST_DATA_DIR/raw_data $HOST_DATA_DIR/processed_data

As you can see, it shows that all the directory is empty even after running the above command. The raw_data/throw are correct but it is not preprocessing it properly.

Thank you

To narrow down, could you please open a terminal instead to run the shell script?
The script is actually from tao_toolkit_recipes/preprocess_HMDB_RGB.sh at main · NVIDIA-AI-IOT/tao_toolkit_recipes · GitHub.

I copied the code from github to the “preprocess_HMDB_RGB.sh” and ran on terminal but it still doesn’t process the videos into frames giving the same output as from the notebook.

I checked the videos from the raw_data/throw are not corrupted.

Can you share the command you run in terminal?

Ok it works when I ran it on my local machine rather than my VM. Thanks a lot though!

1 Like

What is the error? I set the config to run 18 epochs but it seems to have an error after finishing the 17 epoch.

Train RGB only model with PTM
2023-05-24 07:51:44,015 [INFO] root: Registry: ['nvcr.io']
2023-05-24 07:51:44,068 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-pyt
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
[NeMo W 2023-05-24 07:51:57 nemo_logging:349] <frozen cv.action_recognition.scripts.train>:81: UserWarning: 
    'train_rgb_3d_finetune.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
Created a temporary directory at /tmp/tmpw74w6lw9
Writing /tmp/tmpw74w6lw9/_remote_module_non_scriptable.py
loading trained weights from /results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt
ResNet3d(
  (conv1): Conv3d(3, 64, kernel_size=(5, 7, 7), stride=(2, 2, 2), padding=(2, 3, 3), bias=False)
  (bn1): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool3d(kernel_size=(1, 3, 3), stride=2, padding=(0, 1, 1), dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock3d(
      (conv1): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock3d(
      (conv1): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer2): Sequential(
    (0): BasicBlock3d(
      (conv1): Conv3d(64, 128, kernel_size=(3, 3, 3), stride=(1, 2, 2), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(128, 128, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv3d(64, 128, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
        (1): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock3d(
      (conv1): Conv3d(128, 128, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(128, 128, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer3): Sequential(
    (0): BasicBlock3d(
      (conv1): Conv3d(128, 256, kernel_size=(3, 3, 3), stride=(1, 2, 2), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(256, 256, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv3d(128, 256, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
        (1): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock3d(
      (conv1): Conv3d(256, 256, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(256, 256, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer4): Sequential(
    (0): BasicBlock3d(
      (conv1): Conv3d(256, 512, kernel_size=(3, 3, 3), stride=(1, 2, 2), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(512, 512, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv3d(256, 512, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
        (1): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock3d(
      (conv1): Conv3d(512, 512, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(512, 512, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (avg_pool): AdaptiveAvgPool3d(output_size=(1, 1, 1))
  (fc_cls): Linear(in_features=512, out_features=1, bias=True)
)
[NeMo W 2023-05-24 07:52:02 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/callback_connector.py:151: LightningDeprecationWarning: Setting `Trainer(checkpoint_callback=False)` is deprecated in v1.5 and will be removed in v1.7. Please consider using `Trainer(enable_checkpointing=False)`.
      rank_zero_deprecation(
    
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: /results/rgb_3d_ptm/lightning_logs
Train dataset samples: 70
Validation dataset samples: 30
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Adjusting learning rate of group 0 to 1.0000e-03.

  | Name           | Type     | Params
--------------------------------------------
0 | model          | ResNet3d | 33.2 M
1 | train_accuracy | Accuracy | 0     
2 | val_accuracy   | Accuracy | 0     
--------------------------------------------
33.2 M    Trainable params
0         Non-trainable params
33.2 M    Total params
132.742   Total estimated model params size (MB)
[NeMo W 2023-05-24 07:52:05 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1938: PossibleUserWarning: The number of training samples (11) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
      rank_zero_warn(
    
Epoch 0:  83%|██████████████▏  | 10/12 [00:13<00:02,  1.40s/it, loss=0, v_num=0]Adjusting learning rate of group 0 to 1.0000e-03.
Epoch 0:  92%|███████████████▌ | 11/12 [00:21<00:01,  1.91s/it, loss=0, v_num=0]
Validation: 0it [00:00, ?it/s]
Validation DataLoader 0:   0%|                            | 0/1 [00:00<?, ?it/s]
Epoch 0: 100%|█| 12/12 [00:21<00:00,  1.80s/it, loss=0, v_num=0, val_loss=0.000,
Epoch 1:  83%|▊| 10/12 [00:28<00:05,  2.88s/it, loss=0, v_num=0, val_loss=0.000,Adjusting learning rate of group 0 to 1.0000e-03.
Epoch 1:  92%|▉| 11/12 [00:29<00:02,  2.65s/it, loss=0, v_num=0, val_loss=0.000,
Validation: 0it [00:00, ?it/s]
Validation DataLoader 0:   0%|                            | 0/1 [00:00<?, ?it/s]
Epoch 1: 100%|█| 12/12 [00:29<00:00,  2.48s/it, loss=0, v_num=0, val_loss=0.000,
Epoch 2:  83%|▊| 10/12 [00:36<00:07,  3.68s/it, loss=0, v_num=0, val_loss=0.000,Adjusting learning rate of group 0 to 1.0000e-03.
Epoch 2:  92%|▉| 11/12 [00:37<00:03,  3.37s/it, loss=0, v_num=0, val_loss=0.000,
Validation: 0it [00:00, ?it/s]
Validation DataLoader 0:   0%|                            | 0/1 [00:00<?, ?it/s]
Epoch 2: 100%|█| 12/12 [00:37<00:00,  3.15s/it, loss=0, v_num=0, val_loss=0.000,
Epoch 3:  83%|▊| 10/12 [00:44<00:08,  4.49s/it, loss=0, v_num=0, val_loss=0.000,Adjusting learning rate of group 0 to 1.0000e-03.
Epoch 3:  92%|▉| 11/12 [00:45<00:04,  4.10s/it, loss=0, v_num=0, val_loss=0.000,
Validation: 0it [00:00, ?it/s]
Validation DataLoader 0:   0%|                            | 0/1 [00:00<?, ?it/s]
Epoch 3: 100%|█| 12/12 [00:45<00:00,  3.81s/it, loss=0, v_num=0, val_loss=0.000,
Epoch 4:  83%|▊| 10/12 [00:53<00:10,  5.31s/it, loss=0, v_num=0, val_loss=0.000,Adjusting learning rate of group 0 to 1.0000e-04.
Epoch 4:  92%|▉| 11/12 [00:53<00:04,  4.85s/it, loss=0, v_num=0, val_loss=0.000,
Validation: 0it [00:00, ?it/s]
Validation DataLoader 0:   0%|                            | 0/1 [00:00<?, ?it/s]
Epoch 4: 100%|█| 12/12 [00:54<00:00,  4.50s/it, loss=0, v_num=0, val_loss=0.000,
Epoch 5:  83%|▊| 10/12 [01:01<00:12,  6.11s/it, loss=0, v_num=0, val_loss=0.000,Adjusting learning rate of group 0 to 1.0000e-04.
Epoch 5:  92%|▉| 11/12 [01:01<00:05,  5.58s/it, loss=0, v_num=0, val_loss=0.000,
Validation: 0it [00:00, ?it/s]
Validation DataLoader 0:   0%|                            | 0/1 [00:00<?, ?it/s]
Epoch 5: 100%|█| 12/12 [01:02<00:00,  5.17s/it, loss=0, v_num=0, val_loss=0.000,
Epoch 6:  83%|▊| 10/12 [01:09<00:13,  6.92s/it, loss=0, v_num=0, val_loss=0.000,Adjusting learning rate of group 0 to 1.0000e-04.
Epoch 6:  92%|▉| 11/12 [01:09<00:06,  6.31s/it, loss=0, v_num=0, val_loss=0.000,
Validation: 0it [00:00, ?it/s]
Validation DataLoader 0:   0%|                            | 0/1 [00:00<?, ?it/s]
Epoch 6: 100%|█| 12/12 [01:10<00:00,  5.84s/it, loss=0, v_num=0, val_loss=0.000,
Epoch 7:  83%|▊| 10/12 [01:17<00:15,  7.73s/it, loss=0, v_num=0, val_loss=0.000,Adjusting learning rate of group 0 to 1.0000e-04.
Epoch 7:  92%|▉| 11/12 [01:17<00:07,  7.05s/it, loss=0, v_num=0, val_loss=0.000,
Validation: 0it [00:00, ?it/s]
Validation DataLoader 0:   0%|                            | 0/1 [00:00<?, ?it/s]
Epoch 7: 100%|█| 12/12 [01:18<00:00,  6.51s/it, loss=0, v_num=0, val_loss=0.000,
Epoch 8:  83%|▊| 10/12 [01:25<00:17,  8.53s/it, loss=0, v_num=0, val_loss=0.000,Adjusting learning rate of group 0 to 1.0000e-04.
Epoch 8:  92%|▉| 11/12 [01:25<00:07,  7.78s/it, loss=0, v_num=0, val_loss=0.000,
Validation: 0it [00:00, ?it/s]
Validation DataLoader 0:   0%|                            | 0/1 [00:00<?, ?it/s]
Epoch 8: 100%|█| 12/12 [01:26<00:00,  7.19s/it, loss=0, v_num=0, val_loss=0.000,
Epoch 9:  83%|▊| 10/12 [01:33<00:18,  9.33s/it, loss=0, v_num=0, val_loss=0.000,Adjusting learning rate of group 0 to 1.0000e-04.
Epoch 9:  92%|▉| 11/12 [01:33<00:08,  8.51s/it, loss=0, v_num=0, val_loss=0.000,
Validation: 0it [00:00, ?it/s]
Validation DataLoader 0:   0%|                            | 0/1 [00:00<?, ?it/s]
Epoch 9: 100%|█| 12/12 [01:34<00:00,  7.86s/it, loss=0, v_num=0, val_loss=0.000,
Epoch 10:  83%|▊| 10/12 [01:41<00:20, 10.14s/it, loss=0, v_num=0, val_loss=0.000Adjusting learning rate of group 0 to 1.0000e-04.
Epoch 10:  92%|▉| 11/12 [01:41<00:09,  9.24s/it, loss=0, v_num=0, val_loss=0.000
Validation: 0it [00:00, ?it/s]
Validation DataLoader 0:   0%|                            | 0/1 [00:00<?, ?it/s]
Epoch 10: 100%|█| 12/12 [01:42<00:00,  8.53s/it, loss=0, v_num=0, val_loss=0.000
Epoch 11:  83%|▊| 10/12 [01:49<00:21, 10.95s/it, loss=0, v_num=0, val_loss=0.000Adjusting learning rate of group 0 to 1.0000e-04.
Epoch 11:  92%|▉| 11/12 [01:49<00:09,  9.98s/it, loss=0, v_num=0, val_loss=0.000
Validation: 0it [00:00, ?it/s]
Validation DataLoader 0:   0%|                            | 0/1 [00:00<?, ?it/s]
Epoch 11: 100%|█| 12/12 [01:50<00:00,  9.20s/it, loss=0, v_num=0, val_loss=0.000
Epoch 12:  83%|▊| 10/12 [01:57<00:23, 11.75s/it, loss=0, v_num=0, val_loss=0.000Adjusting learning rate of group 0 to 1.0000e-04.
Epoch 12:  92%|▉| 11/12 [01:57<00:10, 10.71s/it, loss=0, v_num=0, val_loss=0.000
Validation: 0it [00:00, ?it/s]
Validation DataLoader 0:   0%|                            | 0/1 [00:00<?, ?it/s]
Epoch 12: 100%|█| 12/12 [01:58<00:00,  9.87s/it, loss=0, v_num=0, val_loss=0.000
Epoch 13:  83%|▊| 10/12 [02:05<00:25, 12.55s/it, loss=0, v_num=0, val_loss=0.000Adjusting learning rate of group 0 to 1.0000e-04.
Epoch 13:  92%|▉| 11/12 [02:05<00:11, 11.43s/it, loss=0, v_num=0, val_loss=0.000
Validation: 0it [00:00, ?it/s]
Validation DataLoader 0:   0%|                            | 0/1 [00:00<?, ?it/s]
Epoch 13: 100%|█| 12/12 [02:06<00:00, 10.53s/it, loss=0, v_num=0, val_loss=0.000
Epoch 14:  83%|▊| 10/12 [02:13<00:26, 13.35s/it, loss=0, v_num=0, val_loss=0.000Adjusting learning rate of group 0 to 1.0000e-05.
Epoch 14:  92%|▉| 11/12 [02:13<00:12, 12.16s/it, loss=0, v_num=0, val_loss=0.000
Validation: 0it [00:00, ?it/s]
Validation DataLoader 0:   0%|                            | 0/1 [00:00<?, ?it/s]
Epoch 14: 100%|█| 12/12 [02:14<00:00, 11.21s/it, loss=0, v_num=0, val_loss=0.000
Epoch 15:  83%|▊| 10/12 [02:21<00:28, 14.15s/it, loss=0, v_num=0, val_loss=0.000Adjusting learning rate of group 0 to 1.0000e-05.
Epoch 15:  92%|▉| 11/12 [02:21<00:12, 12.89s/it, loss=0, v_num=0, val_loss=0.000
Validation: 0it [00:00, ?it/s]
Validation DataLoader 0:   0%|                            | 0/1 [00:00<?, ?it/s]
Epoch 15: 100%|█| 12/12 [02:22<00:00, 11.87s/it, loss=0, v_num=0, val_loss=0.000
Epoch 16:  83%|▊| 10/12 [02:29<00:29, 14.95s/it, loss=0, v_num=0, val_loss=0.000Adjusting learning rate of group 0 to 1.0000e-05.
Epoch 16:  92%|▉| 11/12 [02:29<00:13, 13.62s/it, loss=0, v_num=0, val_loss=0.000
Validation: 0it [00:00, ?it/s]
Validation DataLoader 0:   0%|                            | 0/1 [00:00<?, ?it/s]
Epoch 16: 100%|█| 12/12 [02:30<00:00, 12.54s/it, loss=0, v_num=0, val_loss=0.000
Epoch 17:  83%|▊| 10/12 [02:37<00:31, 15.76s/it, loss=0, v_num=0, val_loss=0.000Adjusting learning rate of group 0 to 1.0000e-05.
Epoch 17:  92%|▉| 11/12 [02:37<00:14, 14.35s/it, loss=0, v_num=0, val_loss=0.000
Validation: 0it [00:00, ?it/s]
Validation DataLoader 0:   0%|                            | 0/1 [00:00<?, ?it/s]
Epoch 17: 100%|█| 12/12 [02:38<00:00, 13.21s/it, loss=0, v_num=0, val_loss=0.000
Epoch 17: 100%|█| 12/12 [02:41<00:00, 13.48s/it, loss=0, v_num=0, val_loss=0.000
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: PASS
2023-05-24 07:54:52,582 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

No error. All 18 epochs are done.

Hi,

Can I check if I were to create multiple models like ‘throw’ and ‘snatch’ and ‘throw’ have 100 datasets that can be split to 70/30 and snatch have 50 data sets is split to 35/15. Can I run it together with the notebook or must I run one for throw, one for snatch and this will create 2 exports of rgb_resnet18_3.etlt files?

Can train together with both ‘throw’ and ‘snatch’ actions.

Facing an error when trying to export the RGB model

# Export the RGB model to encrypted ONNX model
!tao action_recognition export \
                   -e $SPECS_DIR/export_rgb.yaml \
                   -k $KEY \
                   model=$RESULTS_DIR/rgb_3d_ptm/rgb_only_model.tlt\
                   output_file=$RESULTS_DIR/export/rgb_resnet18_3.etlt

Error:

2023-05-29 09:39:15,346 [INFO] root: Registry: ['nvcr.io']
2023-05-29 09:39:15,400 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-pyt
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
[NeMo W 2023-05-29 09:39:28 nemo_logging:349] <frozen cv.action_recognition.scripts.export>:155: UserWarning: 
    'export_rgb.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
Created a temporary directory at /tmp/tmp3sbl48t1
Writing /tmp/tmp3sbl48t1/_remote_module_non_scriptable.py
ResNet3d(
  (conv1): Conv3d(3, 64, kernel_size=(5, 7, 7), stride=(2, 2, 2), padding=(2, 3, 3), bias=False)
  (bn1): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool3d(kernel_size=(1, 3, 3), stride=2, padding=(0, 1, 1), dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock3d(
      (conv1): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock3d(
...
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2023-05-29 09:39:37,728 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

This error can be ignored. Please check if rgb_resnet18_3.etlt is generated. If yes, that means exporting is successful.

The file is not in the export directory.

Above log omitted something.
Could you please share the full log? Thanks a lot.

Here is the full log:

2023-05-31 04:36:45,995 [INFO] root: Registry: ['nvcr.io']
2023-05-31 04:36:46,048 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-pyt
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
[NeMo W 2023-05-31 04:36:59 nemo_logging:349] <frozen cv.action_recognition.scripts.export>:155: UserWarning: 
    'export_rgb.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
Created a temporary directory at /tmp/tmp87lljxfr
Writing /tmp/tmp87lljxfr/_remote_module_non_scriptable.py
ResNet3d(
  (conv1): Conv3d(3, 64, kernel_size=(5, 7, 7), stride=(2, 2, 2), padding=(2, 3, 3), bias=False)
  (bn1): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool3d(kernel_size=(1, 3, 3), stride=2, padding=(0, 1, 1), dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock3d(
      (conv1): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock3d(
      (conv1): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(64, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer2): Sequential(
    (0): BasicBlock3d(
      (conv1): Conv3d(64, 128, kernel_size=(3, 3, 3), stride=(1, 2, 2), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(128, 128, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv3d(64, 128, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
        (1): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock3d(
      (conv1): Conv3d(128, 128, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(128, 128, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer3): Sequential(
    (0): BasicBlock3d(
      (conv1): Conv3d(128, 256, kernel_size=(3, 3, 3), stride=(1, 2, 2), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(256, 256, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv3d(128, 256, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
        (1): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock3d(
      (conv1): Conv3d(256, 256, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(256, 256, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer4): Sequential(
    (0): BasicBlock3d(
      (conv1): Conv3d(256, 512, kernel_size=(3, 3, 3), stride=(1, 2, 2), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(512, 512, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv3d(256, 512, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
        (1): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock3d(
      (conv1): Conv3d(512, 512, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn1): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv3d(512, 512, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1), bias=False)
      (bn2): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (avg_pool): AdaptiveAvgPool3d(output_size=(1, 1, 1))
  (fc_cls): Linear(in_features=512, out_features=2, bias=True)
)
Error executing job with overrides: ['encryption_key=nvidia_tao', 'model=/results/rgb_3d_ptm/rgb_only_model.tlt', 'output_file=/results/export/rgb_resnet18_3.etlt']
An error occurred during Hydra's exception formatting:
AssertionError()
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 252, in run_and_report
    assert mdl is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/cv/action_recognition/scripts/export.py>", line 3, in <module>
  File "<frozen cv.action_recognition.scripts.export>", line 155, in <module>
  File "/opt/NeMo/nemo/core/config/hydra_runner.py", line 104, in wrapper
    _run_hydra(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 294, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "<frozen cv.action_recognition.scripts.export>", line 33, in main
  File "<frozen cv.action_recognition.scripts.export>", line 73, in run_export
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 161, in load_from_checkpoint
    model = cls._load_model_state(checkpoint, strict=strict, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 209, in _load_model_state
    keys = model.load_state_dict(checkpoint["state_dict"], strict=strict)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1660, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ActionRecognitionModel:
	size mismatch for model.fc_cls.weight: copying a param with shape torch.Size([9, 512]) from checkpoint, the shape in current model is torch.Size([2, 512]).
	size mismatch for model.fc_cls.bias: copying a param with shape torch.Size([9]) from checkpoint, the shape in current model is torch.Size([2]).
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2023-05-31 04:37:08,536 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Issued resolved! Thank you

Could you share the solution why the issue is fixed now? Thanks.

My ‘export_rgb.yaml’ was not configured properly for the number of classes my custom action recognition has.