Please provide the following information when requesting support.
• Hardware (NVIDIA RTX 3080Ti)
• Network Type (DINO)
• TLT Version (Configuration of the TAO Toolkit Instance
task_group: [‘model’, ‘dataset’, ‘deploy’]
format_version: 3.0
toolkit_version: 5.5.0
published_date: 08/26/2024
)
• Training spec file(
train:
num_gpus: 1
num_nodes: 1
validation_interval: 1
optim:
lr_backbone: 2e-05
lr: 2e-4
lr_steps: [11]
momentum: 0.9
num_epochs: 12
dataset:
train_data_sources:
- image_dir: /workspace/tao-experiments/sample_new/data/val2017/
json_file: /workspace/tao-experiments/sample_new/data/instances_val2017.json
val_data_sources:
- image_dir: /workspace/tao-experiments/sample_new/data/val2017/
json_file: /workspace/tao-experiments/sample_new/data/instances_val2017.json
num_classes: 91
batch_size: 4
workers: 8
augmentation:
fixed_padding: False
model:
backbone: fan_small
train_backbone: True
pretrained_backbone_path: /workspace/tao-experiments/project_dino/gcvit_xxtiny_nvimagenet.pth
num_feature_levels: 4
dec_layers: 6
enc_layers: 6
num_queries: 300
num_select: 100
dropout_ratio: 0.0
dim_feedforward: 2048
)
• How to reproduce the issue ? (
$ dino train -e /tao-pt/nvidia_tao_pytorch/cv/dino/experiment_specs/train_old.yaml
tao-pt# dino train -e /tao-pt/nvidia_tao_pytorch/cv/dino/experiment_specs/train_old.yaml
sys:1: UserWarning:
‘train_old.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning:
‘train_old.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
_run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:236: UserWarning: Log file already exists at /tao-pt/nvidia_tao_pytorch/cv/dino/results/train/status.json
rank_zero_warn(
Seed set to 1234
Train results will be saved at: /tao-pt/nvidia_tao_pytorch/cv/dino/results/train
No pretrained configuration specified for convnext_base_in22k model. Using a default. Please add a config to the model pretrained_cfg registry or pass explicitly.
Loaded pretrained weights from /tao-pt/nvidia_tao_pytorch/cv/dino/fan_small_hybrid_nvimagenet.pth
_IncompatibleKeys(missing_keys=[‘out_norm1.weight’, ‘out_norm1.bias’, ‘out_norm2.weight’, ‘out_norm2.bias’, ‘out_norm3.weight’, ‘out_norm3.bias’, ‘learnable_downsample.weight’, ‘learnable_downsample.bias’], unexpected_keys=[‘norm.weight’, ‘norm.bias’, ‘head.fc.weight’, ‘head.fc.bias’])GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:652: Checkpoint directory /tao-pt/nvidia_tao_pytorch/cv/dino/results/train exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py:28: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn("The verbose parameter is deprecated. Please use get_last_lr() "
| Name | Type | Params
0 | model | DINOModel | 48.2 M
1 | matcher | HungarianMatcher | 0
2 | criterion | SetCriterion | 0
3 | box_processors | PostProcess | 0
48.2 M Trainable params
0 Non-trainable params
48.2 M Total params
192.674 Total estimated model params size (MB)
Sanity Checking: | | 0/? [00:00<?, ?it/s]ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File “/usr/lib/python3.10/multiprocessing/queues.py”, line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File “/usr/lib/python3.10/multiprocessing/reduction.py”, line 51, in dumps
cls(buf, protocol).dump(obj)
File “/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/reductions.py”, line 568, in reduce_storage
fd, size = storage.share_fd_cpu()
File “/usr/local/lib/python3.10/dist-packages/torch/storage.py”, line 304, in wrapper
return fn(self, *args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/storage.py”, line 374, in share_fd_cpu
return super().share_fd_cpu(*args, **kwargs)
RuntimeError: unable to write to file </torch_776_1293102621_0>: No space left on device (28)
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Error executing job with overrides: Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py”, line 69, in _func
raise e
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py”, line 48, in _func
runner(cfg, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py”, line 146, in main
run_experiment(experiment_config=cfg,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py”, line 132, in run_experiment
trainer.fit(pt_model, dm, ckpt_path=resume_ckpt)
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 543, in fit
call._call_and_handle_interrupt(
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py”, line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 579, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 986, in _run
results = self._run_stage()
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1030, in _run_stage
self._run_sanity_check()
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1059, in _run_sanity_check
val_loop.run()
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py”, line 182, in _decorator
return loop_run(self, *args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py”, line 114, in run
self.on_run_start()
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py”, line 245, in on_run_start
self._on_evaluation_epoch_start()
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py”, line 326, in _on_evaluation_epoch_start
call._call_lightning_module_hook(trainer, hook_name, *args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py”, line 157, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/pl_dino_model.py”, line 232, in on_validation_epoch_start
tmp = json.load(f)
File “/usr/lib/python3.10/json/init.py”, line 293, in load
return loads(fp.read(),
File “/usr/lib/python3.10/json/init.py”, line 346, in loads
return _default_decoder.decode(s)
File “/usr/lib/python3.10/json/decoder.py”, line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File “/usr/lib/python3.10/json/decoder.py”, line 353, in raw_decode
obj, end = self.scan_once(s, idx)
File “/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/signal_handling.py”, line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 656) is killed by signal: Bus error. It is possible that dataloader’s workers are out of shared memory. Please try to raise your shared memory limit.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[2025-01-22 07:05:22,962 - TAO Toolkit - root - INFO] Sending telemetry data.
[2025-01-22 07:05:22,962 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2025-01-22 07:05:22,962 - TAO Toolkit - root - INFO] Sending {‘version’: ‘5.5.0’, ‘action’: ‘train’, ‘network’: ‘dino’, ‘gpu’: [‘NVIDIA-GeForce-RTX-3080-Ti-Laptop-GPU’], ‘success’: False, ‘time_lapsed’: 6} to https://api.tao.ngc.nvidia.com.
[2025-01-22 07:05:24,821 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2025-01-22 07:05:24,823 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2025-01-22 07:05:24,823 - TAO Toolkit - root - WARNING] Execution status: FAIL)