Error while training ActionRecognitionNet with TAO

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
v100
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
ActionRecognitionNet
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)

Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 2.0
toolkit_version: 3.21.11
published_date: 11/08/2021

• Training spec file(If have, please share here)

output_dir: /results/rgb_3d_ptm
encryption_key: nvidia_tao
model_config:
  model_type: rgb
  backbone: resnet18
  rgb_seq_length: 3
  input_type: 3d
  sample_strategy: consecutive
  dropout_ratio: 0.0
train_config:
  optim:
    lr: 0.001
    momentum: 0.9
    weight_decay: 0.0001
    lr_scheduler: MultiStep
    lr_steps: [5, 15, 20]
    lr_decay: 0.1
  epochs: 60
  checkpoint_interval: 5
dataset_config:
  train_dataset_dir: /data/train
  val_dataset_dir: /data/test
  label_map:
    gaming: 0
    sleeping: 1
    other: 2
    #fall_floor: 0
    #ride_bike: 1
  output_shape:
  - 224
  - 224
  batch_size: 32
  workers: 8
  clips_per_video: 5
  augmentation_config:
    train_crop_type: no_crop
    horizontal_flip_prob: 0.5
    rgb_input_mean: [0.5]
    rgb_input_std: [0.5]
    val_center_crop: False

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)


TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:446: UserWarning: Checkpoint directory /results/rgb_3d_ptm exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:487: LightningDeprecationWarning: Argument `period` in `ModelCheckpoint` is deprecated in v1.3 and will be removed in v1.5. Please use `every_n_epochs` instead.
  rank_zero_deprecation(
Train dataset samples: 30
Validation dataset samples: 30
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Adjusting learning rate of group 0 to 1.0000e-03.

  | Name           | Type     | Params
--------------------------------------------
0 | model          | ResNet3d | 33.2 M
1 | train_accuracy | Accuracy | 0     
2 | val_accuracy   | Accuracy | 0     
--------------------------------------------
33.2 M    Trainable params
0         Non-trainable params
33.2 M    Total params
132.747   Total estimated model params size (MB)
Validation sanity check:   0%|                            | 0/1 [00:00<?, ?it/s]/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/dataloader/frame_sampler.py:61: RuntimeWarning: divide by zero encountered in remainder
Error executing job with overrides: ['output_dir=/results/rgb_3d_ptm', 'encryption_key=nvidia_tao', 'model_config.rgb_pretrained_model_path=/results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt', 'model_config.rgb_pretrained_num_classes=5']
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 368, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 110, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/scripts/train.py", line 70, in main
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/scripts/train.py", line 59, in run_experiment
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
    self._run(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
    self._dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
    self.accelerator.start_training(self)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
    self._results = trainer.run_stage()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
    return self._run_train()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_train
    self._run_sanity_check(self.lightning_module)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1115, in _run_sanity_check
    self._evaluation_loop.run()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
    dl_outputs = self.epoch_loop.run(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 93, in advance
    batch_idx, batch = next(dataloader_iter)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/opt/conda/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
    raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/dataloader/ar_dataset.py", line 239, in __getitem__
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/dataloader/ar_dataset.py", line 200, in get_frames
IndexError: list index out of range


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/scripts/train.py", line 76, in <module>
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/super_resolution/scripts/configs/hydra_runner.py", line 99, in wrapper
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 367, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 251, in run_and_report
    assert mdl is not None
AssertionError
2022-01-21 14:41:41,769 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

May I know if you follow jupyter notebook without any change?

Hi, Morganh. I use a custom dataset for retraining.I made a custom dataset referring to the format of the HMDB51 dataset.

Thanks for the info. Is it successful when you run jupyter notebook?

Success until the “RUN TAO training” step.

May I know if you can run default jupyter notebook successfully with HMDB51 dataset?

I trained successfully with HMDB51 dataset, but got error with custom dataset.

Could you "ll -sh " the custom dataset?

Could you check if training dataset has 30 samples and Validation dataset has 30 samples ?

Thank you. I don’t have 30samples for both training and validation. But I didn’t find the set number of samples in “train_rgb_3d_finetune.yaml”

How about checking this?
More, is there a dir with no images?

More, may I know if you refer to GPU-optimized AI, Machine Learning, & HPC Software | NVIDIA NGC to prepare your own dataset?

Thanks for your support, the reason for this error is that jupyter lab generates invisible folders.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.