Encountering "Bus error (core dumped)" in NVIDIA Docker Container

yeyuhaosteve · December 20, 2023, 7:41am

Hello NVIDIA Community,

I am reaching out for assistance regarding a persistent issue I am facing while running deep learning models in a NVIDIA Docker container. Despite several troubleshooting attempts, I have been unable to resolve it.

Environment Details:

Docker Container: nvcr.io/nvidia/pytorch:22.03-py3
Hardware: 4x NVIDIA GeForce RTX 2080 Ti
Host OS: Ubuntu 18.04
GPU Driver Version: 525.116.04

Issue Description:
I encounter a “Bus error (core dumped)” when attempting to train models using YOLOv5（https://github.com/ultralytics/yolov5.git）. This issue persists even with a minimal PyTorch script, indicating that the problem might not be specific to YOLOv5.

Steps Already Taken:

Reduced batch size and worker threads.
Set CUDA_VISIBLE_DEVICES to use a single GPU, the training proceeds without any errors.

System Log
python ./train.py --data ./data/coco_person.yaml --cfg ./models/yolov5s_person.yaml --weights ./weights/yolov5s.pt --batch-size 32 --epochs 120 --workers 0 --name s_120 --project yolo_person_s11
wandb: Currently logged in as: yeyuhaosteve. Use wandb login --relogin to force relogin
train: weights=./weights/yolov5s.pt, cfg=./models/yolov5s_person.yaml, data=./data/coco_person.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=120, batch_size=32, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=0, project=yolo_person_s11, name=s_120, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: skipping check (Docker image), for updates see GitHub - ultralytics/yolov5: YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
YOLOv5 🚀 v6.1-243-gbf6d52c Python-3.8.12 torch-1.12.0a0+2c916ef CUDA:0 (NVIDIA GeForce RTX 2080 Ti, 22189MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
TensorBoard: Start with ‘tensorboard --logdir yolo_person_s11’, view at http://localhost:6006/
wandb: Tracking run with wandb version 0.16.1
wandb: Run data is saved locally in /app/yolov5/wandb/run-20231220_072107-pso3pqgb
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run s_120
wandb: ⭐️ View project at Weights & Biases
wandb: 🚀 View run at Weights & Biases
YOLOv5 temporarily requires wandb version 0.12.10 or below. Some features may not work as expected.

             from  n    params  module                                  arguments

0 -1 1 1 -1 1 2 -1 1 3 -1 1 4 5 6 7 8 9 10 11 -1 1 12 [-1, 6] 1 13 14 15 -1 1 16 [-1, 4] 1 17 18 19 [-1, 14] 1 20 21 22 [-1, 10] 1 23 24 [17, 20, 23] 1 YOLOv5s_person 3520 models.common.Conv [3, 32, 6, 2, 2]
18560 models.common.Conv [32, 64, 3, 2]
18816 models.common.C3 [64, 64, 1]
73984 models.common.Conv [64, 128, 3, 2]
-1 2 115712 models.common.C3 [128, 128, 2]
-1 1 295424 models.common.Conv [128, 256, 3, 2]
-1 3 625152 models.common.C3 [256, 256, 3]
-1 1 1180672 models.common.Conv [256, 512, 3, 2]
-1 1 1182720 models.common.C3 [512, 512, 1]
-1 1 656896 models.common.SPPF [512, 512, 5]
-1 1 131584 models.common.Conv [512, 256, 1, 1]
0 torch.nn.modules.upsampling.Upsample [None, 2, ‘nearest’]
0 models.common.Concat [1]
-1 1 361984 models.common.C3 [512, 256, 1, False]
-1 1 33024 models.common.Conv [256, 128, 1, 1]
0 torch.nn.modules.upsampling.Upsample [None, 2, ‘nearest’]
0 models.common.Concat [1]
-1 1 90880 models.common.C3 [256, 128, 1, False]
-1 1 147712 models.common.Conv [128, 128, 3, 2]
0 models.common.Concat [1]
-1 1 296448 models.common.C3 [256, 256, 1, False]
-1 1 590336 models.common.Conv [256, 256, 3, 2]
0 models.common.Concat [1]
-1 1 1182720 models.common.C3 [512, 512, 1, False]
26970 models.yolo.Detect [5, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
summary: 270 layers, 7033114 parameters, 7033114 gradients, 16.0 GFLOPs

Transferred 342/349 items from weights/yolov5s.pt
Fusing layers…
YOLOv5n summary: 213 layers, 1867405 parameters, 0 gradients
AMP: checks failed ❌, disabling Automatic Mixed Precision. See NaN tensor values problem for GTX16xx users (no problem on other devices) · Issue #7908 · ultralytics/yolov5 · GitHub
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight (no decay), 60 weight, 60 bias
WARNING: DP not recommended, use torch.distributed.run for best DDP Multi-GPU results.
See Multi-GPU Tutorial at Multi-GPU Training 🌟 · Issue #475 · ultralytics/yolov5 · GitHub to get started.
train: Scanning ‘/app/datasets/person_data/labels/train’ images and labels…8000 found, 0 missing, 0 empty, 0 corrupt: 100%|██████████| 8000/8000 [0
train: New cache created: /app/datasets/person_data/labels/train.cache
val: Scanning ‘/app/datasets/person_data/labels/val.cache’ images and labels… 1000 found, 0 missing, 0 empty, 0 corrupt: 100%|██████████| 1000/1000
Plotting labels to yolo_person_s11/s_120/labels.jpg…

AutoAnchor: 4.93 anchors/target, 0.999 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Image sizes 640 train, 640 val
Using 0 dataloader workers
Logging results to yolo_person_s11/s_120
Starting training for 120 epochs…

 Epoch   gpu_mem       box       obj       cls    labels  img_size

0%| | 0/250 [00:00<?, ?it/s] [1703056890.511690] [072de2215b05:811 :0] debug.c:1349 UCX WARN ucs_debug_disable_signal: signal 8 was not set in ucs
[1703056890.511736] [072de2215b05:811 :1] debug.c:1349 UCX WARN ucs_debug_disable_signal: signal 8 was not set in ucs
[1703056890.511767] [072de2215b05:811 :2] debug.c:1349 UCX WARN ucs_debug_disable_signal: signal 1 was not set in ucs
[072de2215b05:811 :1:1033] Caught signal 7 (Bus error: nonexistent physical address)
[1703056890.511790] [072de2215b05:811 :0] debug.c:1349 UCX WARN ucs_debug_disable_signal: signal 1 was not set in ucs
[072de2215b05:811 :2:1032] Caught signal 7 (Bus error: nonexistent physical address)
[1703056890.511799] [072de2215b05:811 :2] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[1703056890.511793] [072de2215b05:811 :1] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[072de2215b05:811 :0:1034] Caught signal 7 (Bus error: nonexistent physical address)
Bus error (core dumped)

Despite these measures, the error persists. I suspect it might be related to the configuration of the Docker container or a deeper system-level issue.

I would greatly appreciate any guidance or suggestions on how to resolve this error. Thank you in advance for your time and assistance.

Best regards,

Topic		Replies	Views
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm) TAO Toolkit	5	116	February 14, 2025
Image Classification Pytorch Training Error TAO Toolkit cudnn	10	310	September 23, 2024
Yolo_v4_tiny randomly stops docker container during second or third validation phase with no errors TAO Toolkit yolo	20	880	August 29, 2022
Tao pre-trained yolo4tiny - AssertionError: Must have more boxes than clusters TAO Toolkit	54	2280	January 21, 2022
Classification_pyt error TAO Toolkit jetson	16	96	September 18, 2024
Train.yaml Doesn't exist! TAO Toolkit	16	485	June 11, 2024
Error in TAO-Toolkit while training TAO Toolkit	15	1513	July 6, 2022
Yolov4 multi-gpu training with Darknet Arch encounters a problem TAO Toolkit	17	749	July 2, 2023
YoloV3: Number of unused weights left : 587 DeepStream SDK	4	595	December 13, 2021
Tao toolkit version5 is getting error when comes to training part TAO Toolkit	45	1718	August 22, 2023

Encountering "Bus error (core dumped)" in NVIDIA Docker Container

Related topics