• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)

Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 2.0
toolkit_version: 3.21.11
published_date: 11/08/2021

• Training spec file(If have, please share here)

output_dir: /results/rgb_3d_ptm
encryption_key: nvidia_tao
  model_type: rgb
  backbone: resnet18
  rgb_seq_length: 3
  input_type: 3d
  sample_strategy: consecutive
  dropout_ratio: 0.0
    lr: 0.001
    momentum: 0.9
    weight_decay: 0.0001
    lr_scheduler: MultiStep
    lr_steps: [5, 15, 20]
    lr_decay: 0.1
  epochs: 60
  checkpoint_interval: 5
  train_dataset_dir: /data/train
  val_dataset_dir: /data/test
    gaming: 0
    sleeping: 1
    other: 2
    #fall_floor: 0
    #ride_bike: 1
  - 224
  - 224
  batch_size: 32
  workers: 8
  clips_per_video: 5
    train_crop_type: no_crop
    horizontal_flip_prob: 0.5
    rgb_input_mean: [0.5]
    rgb_input_std: [0.5]
    val_center_crop: False

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/ UserWarning: Checkpoint directory /results/rgb_3d_ptm exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/ LightningDeprecationWarning: Argument `period` in `ModelCheckpoint` is deprecated in v1.3 and will be removed in v1.5. Please use `every_n_epochs` instead.
Train dataset samples: 30
Validation dataset samples: 30
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Adjusting learning rate of group 0 to 1.0000e-03.

  | Name           | Type     | Params
0 | model          | ResNet3d | 33.2 M
1 | train_accuracy | Accuracy | 0     
2 | val_accuracy   | Accuracy | 0     
33.2 M    Trainable params
0         Non-trainable params
33.2 M    Total params
132.747   Total estimated model params size (MB)
Validation sanity check:   0%|                            | 0/1 [00:00<?, ?it/s]/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/dataloader/ RuntimeWarning: divide by zero encountered in remainder
Error executing job with overrides: ['output_dir=/results/rgb_3d_ptm', 'encryption_key=nvidia_tao', 'model_config.rgb_pretrained_model_path=/results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt', 'model_config.rgb_pretrained_num_classes=5']
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/", line 368, in <lambda>
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/", line 110, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/scripts/", line 70, in main
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/scripts/", line 59, in run_experiment
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 553, in fit
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 918, in _run
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 986, in _dispatch
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/", line 92, in start_training
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/", line 161, in start_training
    self._results = trainer.run_stage()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 996, in run_stage
    return self._run_train()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 1031, in _run_train
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 1115, in _run_sanity_check
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/", line 111, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/", line 110, in advance
    dl_outputs =
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/", line 111, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/", line 93, in advance
    batch_idx, batch = next(dataloader_iter)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/", line 521, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/", line 1203, in _next_data
    return self._process_data(data)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/", line 1229, in _process_data
  File "/opt/conda/lib/python3.8/site-packages/torch/", line 434, in reraise
    raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/dataloader/", line 239, in __getitem__
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/dataloader/", line 200, in get_frames
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/scripts/", line 76, in <module>
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/super_resolution/scripts/configs/", line 99, in wrapper
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/", line 367, in _run_hydra
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/", line 251, in run_and_report
    assert mdl is not None
2022-01-21 14:41:41,769 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

May I know if you follow jupyter notebook without any change?

Hi, Morganh. I use a custom dataset for retraining.I made a custom dataset referring to the format of the HMDB51 dataset.

Thanks for the info. Is it successful when you run jupyter notebook?

Success until the “RUN TAO training” step.

May I know if you can run default jupyter notebook successfully with HMDB51 dataset?

I trained successfully with HMDB51 dataset, but got error with custom dataset.

Could you "ll -sh " the custom dataset?

Could you check if training dataset has 30 samples and Validation dataset has 30 samples ?

Thank you. I don’t have 30samples for both training and validation. But I didn’t find the set number of samples in “train_rgb_3d_finetune.yaml”

How about checking this?
More, is there a dir with no images?

More, may I know if you refer to GPU-optimized AI, Machine Learning, & HPC Software | NVIDIA NGC to prepare your own dataset?

Thanks for your support, the reason for this error is that jupyter lab generates invisible folders.

