Errore CUDA failure 'an illegal memory access was encountered'

antenuccialessio · November 30, 2021, 5:09pm

Please provide the following information when requesting support.

• Lamda workstation GEFORCE RTX 2080 Ti
• Network Type Yolov4
•nvidia/tao/tao-toolkit-pyt: v3.21.11-py3
• Sprec file of yolov4

random_seed: 42
yolov4_config {
small_anchor_shape: “[(5.35, 3.90),(8.02, 8.41),(14.26, 5.49)]”
mid_anchor_shape: “[(15.15, 12.55),(28.45, 8.28),(25.85, 19.98)]”
big_anchor_shape: “[(48.05, 13.53),(57.05, 26.31),(112.87, 42.79)]”
box_matching_iou: 0.25
matching_neutral_box_iou: 0.5
arch: “resnet”
nlayers: 18
arch_conv_blocks: 2
loss_loc_weight: 0.8
loss_neg_obj_weights: 100.0
loss_class_weights: 0.5
label_smoothing: 0.0
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.1
small_grid_xy_extend: 0.2
freeze_bn: false
#freeze_blocks: 0
force_relu: false
}
training_config {
batch_size_per_gpu: 4
num_epochs: 300
enable_qat: false
checkpoint_interval: 5
learning_rate {
soft_start_cosine_annealing_schedule {
min_learning_rate: 1e-7
max_learning_rate: 1e-4
soft_start: 0.3
}
}
regularizer {
type: L1
weight: 3e-5
}
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
}
}
#resume_model_path: “/workspace/tao-experiments/yolo_v4/experiment_dir_unpruned/weights/yolov4_resnet18_epoch_070.tlt”
pretrain_model_path: “/workspace/tao-experiments/yolo_v4/pretrained_resnet18/pretrained_object_detection_vresnet18/resnet_18.hdf5”
}
eval_config {
average_precision_mode: SAMPLE
batch_size: 4
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.001
clustering_iou_threshold: 0.5
force_on_cpu: false
top_k: 200
}
augmentation_config {
hue: 0.1
saturation: 1.5
exposure:1.5
vertical_flip:0.5
horizontal_flip: 0.5
jitter: 0.3
output_width: 1248
output_height: 384
output_channel: 3
randomize_input_shape_period: 0
mosaic_prob: 0.1
mosaic_min_ratio:0.2
}
dataset_config {
data_sources: {
tfrecords_path: “/workspace/tao-experiments/data/training/tfrecords/train*”
image_directory_path: “/workspace/tao-experiments/data/training”
}
include_difficult_in_training: true
image_extension: “png”
target_class_mapping {
key: “vehicle”
value: “vehicle”
}
validation_data_sources: {
tfrecords_path: “/workspace/tao-experiments/data/val/tfrecords/val*”
image_directory_path: “/workspace/tao-experiments/data/val”
}
}

Hello to all,
i am trying to train a yolov4 model using a lambda workstation and geforce rtx 2080 Ti.
The workstation has 128 GB RAM.
When I start the training I get this error during the evaluation of the model how can I solve?
fcace4bde730: 72: 122 [0] init.cc:951 NCCL WARN Cuda failure ‘an illegal memory access was encountered’

r.rovella91 · November 30, 2021, 5:38pm

I also have this problem -_-

kayccc · November 30, 2021, 10:57pm

Thanks for the update, would you please help to share how it’s resolved?

Morganh · December 1, 2021, 1:00am

@antenuccialessio
Please share the full log. Thanks.

Morganh · February 16, 2022, 6:04am

Unfortunately, in 3.21.11 version, there is an issue for “yolo_v4 evaluate”.
Please change to sequence format as below.

validation_data_sources: {
label_directory_path: “xxx”
image_directory_path: “xxx”
}

Morganh · February 16, 2022, 8:50am

For training or evaluation on tfrecord files, please set
force_on_cpu : True

Setting it to True will force NMS to run on CPU during training. This is useful when using TFRecord dataset for validation during training since there is a known issue with TensorFlow NMS on GPU when using TFRecord dataset for validation. Note Note that this flag does not have any impact on TAO export and TensorRT/DeepStream inference.

See more in
https://docs.nvidia.com/tao/tao-toolkit/text/object_detection/yolo_v4.html#nms-config

Topic		Replies	Views
Error: cuStreamSynchronize failed: an illegal memory access was encountered TAO Toolkit yolo , pycuda	7	3978	October 12, 2021
Unable to train yolov4 with Tao succesfully TAO Toolkit	6	610	April 28, 2023
[TLT] YoloV4 training fails. training process asigned to CPU instead of GPU? TAO Toolkit	8	547	August 9, 2022
TAO yolov4_tiny training fails with error TAO Toolkit	4	653	February 2, 2023
No CUDA-capable device is detected - yolov4 TAO Toolkit	10	346	August 16, 2024
TAO Toolkit exits with "Kill" without reason TAO Toolkit	14	1270	February 28, 2022
No CUDA-capable device is detected TAO Toolkit cuda , tao	9	246	February 17, 2025
TAO yoloV4 cannot train from checkpoint TAO Toolkit	8	538	August 5, 2022
Training Yolov4 with 4 GPUs cause out of memory TAO Toolkit	4	1051	August 3, 2022
Error when trying to retrain yolo_v4 TAO Toolkit	7	1094	October 31, 2022

Errore CUDA failure 'an illegal memory access was encountered'

Related topics