Encountering "Bus error (core dumped)" in NVIDIA Docker Container

Hello NVIDIA Community,

I am reaching out for assistance regarding a persistent issue I am facing while running deep learning models in a NVIDIA Docker container. Despite several troubleshooting attempts, I have been unable to resolve it.

Environment Details:

Issue Description:
I encounter a “Bus error (core dumped)” when attempting to train models using YOLOv5(https://github.com/ultralytics/yolov5.git). This issue persists even with a minimal PyTorch script, indicating that the problem might not be specific to YOLOv5.

Steps Already Taken:

  1. Reduced batch size and worker threads.
  2. Set CUDA_VISIBLE_DEVICES to use a single GPU, the training proceeds without any errors.

System Log
python ./train.py --data ./data/coco_person.yaml --cfg ./models/yolov5s_person.yaml --weights ./weights/yolov5s.pt --batch-size 32 --epochs 120 --workers 0 --name s_120 --project yolo_person_s11
wandb: Currently logged in as: yeyuhaosteve. Use wandb login --relogin to force relogin
train: weights=./weights/yolov5s.pt, cfg=./models/yolov5s_person.yaml, data=./data/coco_person.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=120, batch_size=32, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=0, project=yolo_person_s11, name=s_120, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: skipping check (Docker image), for updates see GitHub - ultralytics/yolov5: YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
YOLOv5 🚀 v6.1-243-gbf6d52c Python-3.8.12 torch-1.12.0a0+2c916ef CUDA:0 (NVIDIA GeForce RTX 2080 Ti, 22189MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
TensorBoard: Start with ‘tensorboard --logdir yolo_person_s11’, view at http://localhost:6006/
wandb: Tracking run with wandb version 0.16.1
wandb: Run data is saved locally in /app/yolov5/wandb/run-20231220_072107-pso3pqgb
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run s_120
wandb: ⭐️ View project at Weights & Biases
wandb: 🚀 View run at Weights & Biases
YOLOv5 temporarily requires wandb version 0.12.10 or below. Some features may not work as expected.

             from  n    params  module                                  arguments                     

0 -1 1 3520 models.common.Conv [3, 32, 6, 2, 2]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 18816 models.common.C3 [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 2 115712 models.common.C3 [128, 128, 2]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 3 625152 models.common.C3 [256, 256, 3]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 1182720 models.common.C3 [512, 512, 1]
9 -1 1 656896 models.common.SPPF [512, 512, 5]
10 -1 1 131584 models.common.Conv [512, 256, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, ‘nearest’]
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 361984 models.common.C3 [512, 256, 1, False]
14 -1 1 33024 models.common.Conv [256, 128, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, ‘nearest’]
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 90880 models.common.C3 [256, 128, 1, False]
18 -1 1 147712 models.common.Conv [128, 128, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 1 296448 models.common.C3 [256, 256, 1, False]
21 -1 1 590336 models.common.Conv [256, 256, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 1 1182720 models.common.C3 [512, 512, 1, False]
24 [17, 20, 23] 1 26970 models.yolo.Detect [5, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
YOLOv5s_person summary: 270 layers, 7033114 parameters, 7033114 gradients, 16.0 GFLOPs

Transferred 342/349 items from weights/yolov5s.pt
Fusing layers…
YOLOv5n summary: 213 layers, 1867405 parameters, 0 gradients
AMP: checks failed ❌, disabling Automatic Mixed Precision. See NaN tensor values problem for GTX16xx users (no problem on other devices) · Issue #7908 · ultralytics/yolov5 · GitHub
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight (no decay), 60 weight, 60 bias
WARNING: DP not recommended, use torch.distributed.run for best DDP Multi-GPU results.
See Multi-GPU Tutorial at Multi-GPU Training 🌟 · Issue #475 · ultralytics/yolov5 · GitHub to get started.
train: Scanning ‘/app/datasets/person_data/labels/train’ images and labels…8000 found, 0 missing, 0 empty, 0 corrupt: 100%|██████████| 8000/8000 [0
train: New cache created: /app/datasets/person_data/labels/train.cache
val: Scanning ‘/app/datasets/person_data/labels/val.cache’ images and labels… 1000 found, 0 missing, 0 empty, 0 corrupt: 100%|██████████| 1000/1000
Plotting labels to yolo_person_s11/s_120/labels.jpg…

AutoAnchor: 4.93 anchors/target, 0.999 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Image sizes 640 train, 640 val
Using 0 dataloader workers
Logging results to yolo_person_s11/s_120
Starting training for 120 epochs…

 Epoch   gpu_mem       box       obj       cls    labels  img_size

0%| | 0/250 [00:00<?, ?it/s] [1703056890.511690] [072de2215b05:811 :0] debug.c:1349 UCX WARN ucs_debug_disable_signal: signal 8 was not set in ucs
[1703056890.511736] [072de2215b05:811 :1] debug.c:1349 UCX WARN ucs_debug_disable_signal: signal 8 was not set in ucs
[1703056890.511767] [072de2215b05:811 :2] debug.c:1349 UCX WARN ucs_debug_disable_signal: signal 1 was not set in ucs
[072de2215b05:811 :1:1033] Caught signal 7 (Bus error: nonexistent physical address)
[1703056890.511790] [072de2215b05:811 :0] debug.c:1349 UCX WARN ucs_debug_disable_signal: signal 1 was not set in ucs
[072de2215b05:811 :2:1032] Caught signal 7 (Bus error: nonexistent physical address)
[1703056890.511799] [072de2215b05:811 :2] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[1703056890.511793] [072de2215b05:811 :1] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[072de2215b05:811 :0:1034] Caught signal 7 (Bus error: nonexistent physical address)
Bus error (core dumped)

Despite these measures, the error persists. I suspect it might be related to the configuration of the Docker container or a deeper system-level issue.

I would greatly appreciate any guidance or suggestions on how to resolve this error. Thank you in advance for your time and assistance.

Best regards,