Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc)
v100
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
ActionRecognitionNet
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 2.0
toolkit_version: 3.21.11
published_date: 11/08/2021
• Training spec file(If have, please share here)
output_dir: /results/rgb_3d_ptm
encryption_key: nvidia_tao
model_config:
model_type: rgb
backbone: resnet18
rgb_seq_length: 3
input_type: 3d
sample_strategy: consecutive
dropout_ratio: 0.0
train_config:
optim:
lr: 0.001
momentum: 0.9
weight_decay: 0.0001
lr_scheduler: MultiStep
lr_steps: [5, 15, 20]
lr_decay: 0.1
epochs: 60
checkpoint_interval: 5
dataset_config:
train_dataset_dir: /data/train
val_dataset_dir: /data/test
label_map:
gaming: 0
sleeping: 1
other: 2
#fall_floor: 0
#ride_bike: 1
output_shape:
- 224
- 224
batch_size: 32
workers: 8
clips_per_video: 5
augmentation_config:
train_crop_type: no_crop
horizontal_flip_prob: 0.5
rgb_input_mean: [0.5]
rgb_input_std: [0.5]
val_center_crop: False
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:446: UserWarning: Checkpoint directory /results/rgb_3d_ptm exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:487: LightningDeprecationWarning: Argument `period` in `ModelCheckpoint` is deprecated in v1.3 and will be removed in v1.5. Please use `every_n_epochs` instead.
rank_zero_deprecation(
Train dataset samples: 30
Validation dataset samples: 30
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Adjusting learning rate of group 0 to 1.0000e-03.
| Name | Type | Params
--------------------------------------------
0 | model | ResNet3d | 33.2 M
1 | train_accuracy | Accuracy | 0
2 | val_accuracy | Accuracy | 0
--------------------------------------------
33.2 M Trainable params
0 Non-trainable params
33.2 M Total params
132.747 Total estimated model params size (MB)
Validation sanity check: 0%| | 0/1 [00:00<?, ?it/s]/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/dataloader/frame_sampler.py:61: RuntimeWarning: divide by zero encountered in remainder
Error executing job with overrides: ['output_dir=/results/rgb_3d_ptm', 'encryption_key=nvidia_tao', 'model_config.rgb_pretrained_model_path=/results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt', 'model_config.rgb_pretrained_num_classes=5']
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 368, in <lambda>
lambda: hydra.run(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 110, in run
_ = ret.return_value
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/scripts/train.py", line 70, in main
File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/scripts/train.py", line 59, in run_experiment
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
self._run(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
self._dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
self.accelerator.start_training(self)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
self.training_type_plugin.start_training(trainer)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
self._results = trainer.run_stage()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
return self._run_train()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_train
self._run_sanity_check(self.lightning_module)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1115, in _run_sanity_check
self._evaluation_loop.run()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
dl_outputs = self.epoch_loop.run(
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 93, in advance
batch_idx, batch = next(dataloader_iter)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/opt/conda/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/dataloader/ar_dataset.py", line 239, in __getitem__
File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/dataloader/ar_dataset.py", line 200, in get_frames
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/scripts/train.py", line 76, in <module>
File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/super_resolution/scripts/configs/hydra_runner.py", line 99, in wrapper
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 367, in _run_hydra
run_and_report(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 251, in run_and_report
assert mdl is not None
AssertionError
2022-01-21 14:41:41,769 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.