ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)

Please provide the following information when requesting support.

• Hardware (NVIDIA RTX 3080Ti)
• Network Type (DINO)
• TLT Version (Configuration of the TAO Toolkit Instance
task_group: [‘model’, ‘dataset’, ‘deploy’]
format_version: 3.0
toolkit_version: 5.5.0
published_date: 08/26/2024
)
• Training spec file(
train:
num_gpus: 1
num_nodes: 1
validation_interval: 1
optim:
lr_backbone: 2e-05
lr: 2e-4
lr_steps: [11]
momentum: 0.9
num_epochs: 12
dataset:
train_data_sources:
- image_dir: /workspace/tao-experiments/sample_new/data/val2017/
json_file: /workspace/tao-experiments/sample_new/data/instances_val2017.json
val_data_sources:
- image_dir: /workspace/tao-experiments/sample_new/data/val2017/
json_file: /workspace/tao-experiments/sample_new/data/instances_val2017.json
num_classes: 91
batch_size: 4
workers: 8
augmentation:
fixed_padding: False
model:
backbone: fan_small
train_backbone: True
pretrained_backbone_path: /workspace/tao-experiments/project_dino/gcvit_xxtiny_nvimagenet.pth
num_feature_levels: 4
dec_layers: 6
enc_layers: 6
num_queries: 300
num_select: 100
dropout_ratio: 0.0
dim_feedforward: 2048
)
• How to reproduce the issue ? (

$ dino train -e /tao-pt/nvidia_tao_pytorch/cv/dino/experiment_specs/train_old.yaml

tao-pt# dino train -e /tao-pt/nvidia_tao_pytorch/cv/dino/experiment_specs/train_old.yaml
sys:1: UserWarning:
‘train_old.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning:
‘train_old.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
_run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:236: UserWarning: Log file already exists at /tao-pt/nvidia_tao_pytorch/cv/dino/results/train/status.json
rank_zero_warn(
Seed set to 1234
Train results will be saved at: /tao-pt/nvidia_tao_pytorch/cv/dino/results/train
No pretrained configuration specified for convnext_base_in22k model. Using a default. Please add a config to the model pretrained_cfg registry or pass explicitly.
Loaded pretrained weights from /tao-pt/nvidia_tao_pytorch/cv/dino/fan_small_hybrid_nvimagenet.pth
_IncompatibleKeys(missing_keys=[‘out_norm1.weight’, ‘out_norm1.bias’, ‘out_norm2.weight’, ‘out_norm2.bias’, ‘out_norm3.weight’, ‘out_norm3.bias’, ‘learnable_downsample.weight’, ‘learnable_downsample.bias’], unexpected_keys=[‘norm.weight’, ‘norm.bias’, ‘head.fc.weight’, ‘head.fc.bias’])GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:652: Checkpoint directory /tao-pt/nvidia_tao_pytorch/cv/dino/results/train exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py:28: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn("The verbose parameter is deprecated. Please use get_last_lr() "

| Name | Type | Params

0 | model | DINOModel | 48.2 M
1 | matcher | HungarianMatcher | 0
2 | criterion | SetCriterion | 0
3 | box_processors | PostProcess | 0

48.2 M Trainable params
0 Non-trainable params
48.2 M Total params
192.674 Total estimated model params size (MB)

Sanity Checking: | | 0/? [00:00<?, ?it/s]ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File “/usr/lib/python3.10/multiprocessing/queues.py”, line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File “/usr/lib/python3.10/multiprocessing/reduction.py”, line 51, in dumps
cls(buf, protocol).dump(obj)
File “/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/reductions.py”, line 568, in reduce_storage
fd, size = storage.share_fd_cpu()
File “/usr/local/lib/python3.10/dist-packages/torch/storage.py”, line 304, in wrapper
return fn(self, *args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/storage.py”, line 374, in share_fd_cpu
return super().share_fd_cpu(*args, **kwargs)
RuntimeError: unable to write to file </torch_776_1293102621_0>: No space left on device (28)
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Error executing job with overrides: Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py”, line 69, in _func
raise e
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py”, line 48, in _func
runner(cfg, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py”, line 146, in main
run_experiment(experiment_config=cfg,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py”, line 132, in run_experiment
trainer.fit(pt_model, dm, ckpt_path=resume_ckpt)
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 543, in fit
call._call_and_handle_interrupt(
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py”, line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 579, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 986, in _run
results = self._run_stage()
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1030, in _run_stage
self._run_sanity_check()
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1059, in _run_sanity_check
val_loop.run()
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py”, line 182, in _decorator
return loop_run(self, *args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py”, line 114, in run
self.on_run_start()
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py”, line 245, in on_run_start
self._on_evaluation_epoch_start()
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py”, line 326, in _on_evaluation_epoch_start
call._call_lightning_module_hook(trainer, hook_name, *args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py”, line 157, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/pl_dino_model.py”, line 232, in on_validation_epoch_start
tmp = json.load(f)
File “/usr/lib/python3.10/json/init.py”, line 293, in load
return loads(fp.read(),
File “/usr/lib/python3.10/json/init.py”, line 346, in loads
return _default_decoder.decode(s)
File “/usr/lib/python3.10/json/decoder.py”, line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File “/usr/lib/python3.10/json/decoder.py”, line 353, in raw_decode
obj, end = self.scan_once(s, idx)
File “/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/signal_handling.py”, line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 656) is killed by signal: Bus error. It is possible that dataloader’s workers are out of shared memory. Please try to raise your shared memory limit.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

                                              [2025-01-22 07:05:22,962 - TAO Toolkit - root - INFO] Sending telemetry data.

[2025-01-22 07:05:22,962 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2025-01-22 07:05:22,962 - TAO Toolkit - root - INFO] Sending {‘version’: ‘5.5.0’, ‘action’: ‘train’, ‘network’: ‘dino’, ‘gpu’: [‘NVIDIA-GeForce-RTX-3080-Ti-Laptop-GPU’], ‘success’: False, ‘time_lapsed’: 6} to https://api.tao.ngc.nvidia.com.
[2025-01-22 07:05:24,821 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2025-01-22 07:05:24,823 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2025-01-22 07:05:24,823 - TAO Toolkit - root - WARNING] Execution status: FAIL)

Should be related to out-of-memory. Please set larger --shm-size in the docker run command line and retry. For example, --shm-size 100G.

~/tao_pytorch_backend-main$ docker run --runtime=nvidia -it --rm -v /home/quest/tao_pytorch_backend-main:/tao-pt nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt /bin/bash --shm-size 100G

===========================
=== TAO Toolkit PyTorch ===

NVIDIA Release 5.5.0-PyT (build 88113656)
TAO Toolkit Version 5.5.0

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:

WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 535.183.01 which has support for CUDA 12.2. This container
was built with CUDA 12.4 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
See 1. Why CUDA Compatibility — CUDA Compatibility r555 documentation for details.

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for TAO Toolkit. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 …

/bin/bash: --shm-size: invalid option
Usage: /bin/bash [GNU long option] [option] …
/bin/bash [GNU long option] [option] script-file …
GNU long options:
–debug
–debugger
–dump-po-strings
–dump-strings
–help
–init-file
–login
–noediting
–noprofile
–norc
–posix
–pretty-print
–rcfile
–restricted
–verbose
–version
Shell options:
-ilrsD or -c command or -O shopt_option (invocation only)
-abefhkmnptuvxBCHP or -o option

Please move it to a little earlier.

docker run --runtime=nvidia -it --rm --shm-size 100G -v /home/quest/tao_pytorch_backend-main:/tao-pt nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt /bin/bash

More info can be found in
How to increase the size of the /dev/shm in docker container - Stack Overflow.

Thank you.