TAO ssd training error

Running TAO training from following command.

!ssd train --gpus 1 --gpu_index $GPU_INDEX
-e $SPECS_DIR/ssd_train_resnet18_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-m $USER_EXPERIMENT_DIR/pretrained_resnet18/pretrained_object_detection_vresnet18/resnet_18.hdf5

Please provide the following information when requesting support.

2024-03-12 07:20:24,820 [TAO Toolkit] [INFO] main 356: Number of images in the training dataset: 575
2024-03-12 07:20:24,820 [TAO Toolkit] [INFO] main 358: Number of images in the validation dataset: 144
2024-03-12 07:20:25,450 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.common.logging.logging 197: Log file already exists at /workspace/tao-experiments/ssd/experiment_dir_unpruned/status.json
2024-03-12 07:20:29,271 [TAO Toolkit] [INFO] root 2102: Starting Training Loop.
Epoch 1/10
19/36 [==============>…] - ETA: 38s - loss: 36.5114DALI daliShareOutput(&pipe_handle_) failed: Critical error in pipeline:
Error when executing GPU operator Slice encountered:
Can’t allocate 6383730688 bytes on device 0.
Current pipeline object is no longer valid.
2024-03-12 07:21:13,426 [TAO Toolkit] [INFO] root 2102: 2 root error(s) found.
(0) Internal: DALI daliShareOutput(&pipe_handle_) failed: Critical error in pipeline:
Error when executing GPU operator Slice encountered:
Can’t allocate 6383730688 bytes on device 0.
Current pipeline object is no longer valid.
[[{{node Dali}}]]
[[cond_14/SliceReplace_5/range/4975]]
(1) Internal: DALI daliShareOutput(&pipe_handle
) failed: Critical error in pipeline:
Error when executing GPU operator Slice encountered:
Can’t allocate 6383730688 bytes on device 0.
Current pipeline object is no longer valid.
[[{{node Dali}}]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/ssd/scripts/train.py”, line 586, in
main()
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py”, line 717, in return_func
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py”, line 705, in return_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/ssd/scripts/train.py”, line 582, in main
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/ssd/scripts/train.py”, line 562, in main
run_experiment(config_path=args.experiment_spec_file,
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/ssd/scripts/train.py”, line 469, in run_experiment
model.fit(steps_per_epoch=iters_per_epoch,
File “/usr/local/lib/python3.8/dist-packages/keras/engine/training.py”, line 1027, in fit
return training_arrays.fit_loop(self, f, ins,
File “/usr/local/lib/python3.8/dist-packages/keras/engine/training_arrays.py”, line 154, in fit_loop
outs = f(ins)
File “/usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py”, line 2715, in call
return self._call(inputs)
File “/usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py”, line 2675, in _call
fetched = self._callable_fn(*array_vals)
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py”, line 1470, in call
ret = tf_session.TF_SessionRunCallable(self._session.session,
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: DALI daliShareOutput(&pipe_handle
) failed: Critical error in pipeline:
Error when executing GPU operator Slice encountered:
Can’t allocate 6383730688 bytes on device 0.
Current pipeline object is no longer valid.
[[{{node Dali}}]]
[[cond_14/SliceReplace_5/range/4975]]
(1) Internal: DALI daliShareOutput(&pipe_handle
) failed: Critical error in pipeline:
Error when executing GPU operator Slice encountered:
Can’t allocate 6383730688 bytes on device 0.
Current pipeline object is no longer valid.
[[{{node Dali}}]]
0 successful operations.
0 derived errors ignored.
terminate called after throwing an instance of ‘dali::CUDAError’
what(): CUDA runtime API error cudaErrorIllegalAddress (700):
an illegal memory access was encountered
[5bb4629a2b41:12083] *** Process received signal ***
[5bb4629a2b41:12083] Signal: Aborted (6)
[5bb4629a2b41:12083] Signal code: (-6)
[5bb4629a2b41:12083] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f1006ed5090]
[5bb4629a2b41:12083] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f1006ed500b]
[5bb4629a2b41:12083] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f1006eb4859]
[5bb4629a2b41:12083] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7f100635e911]
[5bb4629a2b41:12083] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7f100636a38c]
[5bb4629a2b41:12083] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7f100636a3f7]
[5bb4629a2b41:12083] [ 6] /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali_core.so(+0x286b9)[0x7f0fd24556b9]
[5bb4629a2b41:12083] [ 7] /usr/local/lib/python3.8/dist-packages/nvidia/dali/python_function_plugin.cpython-38-x86_64-linux-gnu.so(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0x46)[0x7f0fd55e1ed6]
[5bb4629a2b41:12083] [ 8] /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali_core.so(+0x17046)[0x7f0fd2444046]
[5bb4629a2b41:12083] [ 9] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x468a7)[0x7f1006ed88a7]
[5bb4629a2b41:12083] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(on_exit+0x0)[0x7f1006ed8a60]
[5bb4629a2b41:12083] [11] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa)[0x7f1006eb608a]
[5bb4629a2b41:12083] [12] python(_start+0x2e)[0x5faa2e]
[5bb4629a2b41:12083] *** End of error message ***
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: Insufficient Permissions
Execution status: FAIL

May I know which dgpu did you use?
To narrow down, could you use smaller part of training images and retry? Thanks.

HI, as per your advice Only 100 data while training getting following error

2024-03-15 11:50:15,995 [TAO Toolkit] [INFO] main 356: Number of images in the training dataset: 100
2024-03-15 11:50:15,996 [TAO Toolkit] [INFO] main 358: Number of images in the validation dataset: 20
2024-03-15 11:50:16,463 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.common.logging.logging 197: Log file already exists at /workspace/tao-experiments/ssd/experiment_dir_unpruned/status.json
2024-03-15 11:50:19,936 [TAO Toolkit] [INFO] root 2102: Starting Training Loop.
Epoch 1/10
7/50 [===>…] - ETA: 1:22 - loss: 38.9288DALI daliShareOutput(&pipe_handle_) failed: Critical error in pipeline:
Error when executing CPU operator RandomBBoxCrop encountered:
Error in thread 3: [/opt/dali/dali/pipeline/util/bounding_box_utils.h:165] Assert on “limits.contains(boxes[i])” failed: box {(0.15379, -3.89552e-18), (0.247275, 0.0541309)} is out of bounds {(0, 0), (1, 1)}
Stacktrace (7 entries):
[frame 0]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali_operators.so(+0x5a7182) [0x7fa8b892b182]
[frame 1]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali_operators.so(+0x6827a6) [0x7fa8b8a067a6]
[frame 2]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali_operators.so(+0x14dc16f) [0x7fa8b986016f]
[frame 3]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool, std::string const&)+0x1d0) [0x7faa527b89e0]
[frame 4]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali.so(+0x754a7f) [0x7faa52d9fa7f]
[frame 5]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7faa83584609]
[frame 6]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7faa836be133]

Current pipeline object is no longer valid.
2024-03-15 11:50:33,322 [TAO Toolkit] [INFO] root 2102: 2 root error(s) found.
(0) Internal: DALI daliShareOutput(&pipe_handle_) failed: Critical error in pipeline:
Error when executing CPU operator RandomBBoxCrop encountered:
Error in thread 3: [/opt/dali/dali/pipeline/util/bounding_box_utils.h:165] Assert on “limits.contains(boxes[i])” failed: box {(0.15379, -3.89552e-18), (0.247275, 0.0541309)} is out of bounds {(0, 0), (1, 1)}
Stacktrace (7 entries):
[frame 0]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali_operators.so(+0x5a7182) [0x7fa8b892b182]
[frame 1]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali_operators.so(+0x6827a6) [0x7fa8b8a067a6]
[frame 2]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali_operators.so(+0x14dc16f) [0x7fa8b986016f]
[frame 3]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool, std::string const&)+0x1d0) [0x7faa527b89e0]
[frame 4]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali.so(+0x754a7f) [0x7faa52d9fa7f]
[frame 5]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7faa83584609]
[frame 6]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7faa836be133]

Current pipeline object is no longer valid.
[[{{node Dali}}]]
[[cond/SliceReplace_4/range/3871]]
(1) Internal: DALI daliShareOutput(&pipe_handle
) failed: Critical error in pipeline:
Error when executing CPU operator RandomBBoxCrop encountered:
Error in thread 3: [/opt/dali/dali/pipeline/util/bounding_box_utils.h:165] Assert on “limits.contains(boxes[i])” failed: box {(0.15379, -3.89552e-18), (0.247275, 0.0541309)} is out of bounds {(0, 0), (1, 1)}
Stacktrace (7 entries):
[frame 0]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali_operators.so(+0x5a7182) [0x7fa8b892b182]
[frame 1]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali_operators.so(+0x6827a6) [0x7fa8b8a067a6]
[frame 2]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali_operators.so(+0x14dc16f) [0x7fa8b986016f]
[frame 3]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool, std::string const&)+0x1d0) [0x7faa527b89e0]
[frame 4]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali.so(+0x754a7f) [0x7faa52d9fa7f]
[frame 5]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7faa83584609]
[frame 6]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7faa836be133]

Current pipeline object is no longer valid.
[[{{node Dali}}]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/ssd/scripts/train.py”, line 586, in
main()
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py”, line 717, in return_func
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py”, line 705, in return_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/ssd/scripts/train.py”, line 582, in main
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/ssd/scripts/train.py”, line 562, in main
run_experiment(config_path=args.experiment_spec_file,
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/ssd/scripts/train.py”, line 469, in run_experiment
model.fit(steps_per_epoch=iters_per_epoch,
File “/usr/local/lib/python3.8/dist-packages/keras/engine/training.py”, line 1027, in fit
return training_arrays.fit_loop(self, f, ins,
File “/usr/local/lib/python3.8/dist-packages/keras/engine/training_arrays.py”, line 154, in fit_loop
outs = f(ins)
File “/usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py”, line 2715, in call
return self._call(inputs)
File “/usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py”, line 2675, in _call
fetched = self._callable_fn(*array_vals)
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py”, line 1470, in call
ret = tf_session.TF_SessionRunCallable(self._session.session,
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: DALI daliShareOutput(&pipe_handle
) failed: Critical error in pipeline:
Error when executing CPU operator RandomBBoxCrop encountered:
Error in thread 3: [/opt/dali/dali/pipeline/util/bounding_box_utils.h:165] Assert on “limits.contains(boxes[i])” failed: box {(0.15379, -3.89552e-18), (0.247275, 0.0541309)} is out of bounds {(0, 0), (1, 1)}
Stacktrace (7 entries):
[frame 0]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali_operators.so(+0x5a7182) [0x7fa8b892b182]
[frame 1]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali_operators.so(+0x6827a6) [0x7fa8b8a067a6]
[frame 2]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali_operators.so(+0x14dc16f) [0x7fa8b986016f]
[frame 3]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool, std::string const&)+0x1d0) [0x7faa527b89e0]
[frame 4]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali.so(+0x754a7f) [0x7faa52d9fa7f]
[frame 5]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7faa83584609]
[frame 6]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7faa836be133]

Current pipeline object is no longer valid.
[[{{node Dali}}]]
[[cond/SliceReplace_4/range/3871]]
(1) Internal: DALI daliShareOutput(&pipe_handle
) failed: Critical error in pipeline:
Error when executing CPU operator RandomBBoxCrop encountered:
Error in thread 3: [/opt/dali/dali/pipeline/util/bounding_box_utils.h:165] Assert on “limits.contains(boxes[i])” failed: box {(0.15379, -3.89552e-18), (0.247275, 0.0541309)} is out of bounds {(0, 0), (1, 1)}
Stacktrace (7 entries):
[frame 0]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali_operators.so(+0x5a7182) [0x7fa8b892b182]
[frame 1]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali_operators.so(+0x6827a6) [0x7fa8b8a067a6]
[frame 2]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali_operators.so(+0x14dc16f) [0x7fa8b986016f]
[frame 3]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool, std::string const&)+0x1d0) [0x7faa527b89e0]
[frame 4]: /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali.so(+0x754a7f) [0x7faa52d9fa7f]
[frame 5]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7faa83584609]
[frame 6]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7faa836be133]

Current pipeline object is no longer valid.
[[{{node Dali}}]]
0 successful operations.
0 derived errors ignored.
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: Insufficient Permissions
Execution status: FAIL

Could you continue to use bisect method to find which label file is the culprit?

Please suggest how to find that.

Also please explain bisect methods and if any reference are their please share.

For example, if there are 100 images, you can check the 50 images. If there are issues in these 50 images, then split again to check 25 images.

Could you share above training spec file as well?

It is expected to use sequence format in validation_data_sources. For SSD, the tfrecords are not supported in validation_data_sources.

Please find the spec file ssd_train_resnet18_kitti.txt.

random_seed: 42
ssd_config {
aspect_ratios_global: “[1.0, 2.0, 0.5, 3.0, 1.0/3.0]”
scales: “[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]”
two_boxes_for_ar1: true
clip_boxes: false
variances: “[0.1, 0.1, 0.2, 0.2]”
arch: “resnet”
nlayers: 18
freeze_bn: false
freeze_blocks: 0
}
training_config {
batch_size_per_gpu: 2
num_epochs: 10
enable_qat: false
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-5
max_learning_rate: 2e-2
soft_start: 0.15
annealing: 0.8
}
}
regularizer {
type: L1
weight: 3e-5
}
}
eval_config {
validation_period_during_training: 10
average_precision_mode: SAMPLE
batch_size: 2
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.01
clustering_iou_threshold: 0.6
top_k: 200
}
augmentation_config {
output_width: 300
output_height: 300
output_channel: 3
}
dataset_config {
data_sources: {
tfrecords_path: “/workspace/tao-experiments/data/ssd/tfrecords/kitti_train*”
}
include_difficult_in_training: true
target_class_mapping {
key: “MathExp”
value: “MathExp”
}
target_class_mapping {
key: “Text”
value: “Text”
}
target_class_mapping {
key: “MathSym”
value: “MathSym”
}
target_class_mapping {
key: “MathOpr”
value: “MathOpr”
}
target_class_mapping {
key: “MathIL”
value: “MathIL”
}
target_class_mapping {
key: “MathText”
value: “MathText”
}
target_class_mapping {
key: “Numeric”
value: “Numeric”
}
target_class_mapping {
key: “TrgDia”
value: “TrgDia”
}
target_class_mapping {
key: “Table”
value: “Table”
}
validation_data_sources: {
image_directory_path: “/workspace/tao-experiments/data/kitti_split/val/image”
label_directory_path: “/workspace/tao-experiments/data/kitti_split/val/label”
}
}

I am afraid there is negative value in the label file. Can you search it?
It is not expected to have a negative value.

As per your advice i have removed all negative values from the label files only training done on 100 dataset. But getting following errors.

2024-03-22 10:32:07,678 [TAO Toolkit] [INFO] root 2102: Starting Training Loop. Epoch 1/10 50/50 [==============================] - 34s 678ms/step - loss: 29.4079 Epoch 00001: saving model to /workspace/tao-experiments/ssd/experiment_dir_unpruned/weights/ssd_resnet18_epoch_001.hdf5 2024-03-22 10:32:54,341 [TAO Toolkit] [INFO] root 2102: Training loop in progress Epoch 2/10 47/50 [===========================>…] - ETA: 1s - loss: 18.5124DALI daliShareOutput(&pipe_handle_) failed: Critical error in pipeline: Error when executing GPU operator Slice encountered: CUDA runtime API error cudaErrorIllegalAddress (700): an illegal memory access was encountered Current pipeline object is no longer valid. 2024-03-22 10:33:16.683296: F ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:680] Non-OK-status: GpuLaunchKernel(BlockReduceKernel<IN_T, T*, num_threads, Op>, num_blocks, num_threads, 0, cu_stream, in, (T*)temp_storage.flat<int8_t>().data(), in_size, op, init) status: Internal: an illegal memory access was encountered [5bb4629a2b41:2966884] *** Process received signal *** [5bb4629a2b41:2966884] Signal: Aborted (6) [5bb4629a2b41:2966884] Signal code: (-6) [5bb4629a2b41:2966884] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f1659b5b090] [5bb4629a2b41:2966884] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f1659b5b00b] [5bb4629a2b41:2966884] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f1659b3a859] [5bb4629a2b41:2966884] [ 3] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x212f44)[0x7f1654b00f44] [5bb4629a2b41:2966884] [ 4] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_cc.so.1(_ZN10tensorflow7functor21LaunchScalarReductionIfNS0_3SumIfEEPfS4_EEvPNS_15OpKernelContextET1_T2_iT0_T_RKP11CUstream_st+0x9db)[0x7f15657ee0fb] [5bb4629a2b41:2966884] [ 5] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_cc.so.1(ZN10tensorflow7functor10ReduceImplIfNS0_3SumIfEEPfS4_N5Eigen5arrayIlLm1EEEEEvPNS_15OpKernelContextET1_T2_iiiiiRKT3_T0+0x421)[0x7f15657f0e01] [5bb4629a2b41:2966884] [ 6] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_cc.so.1(ZN10tensorflow7functor13ReduceFunctorIN5Eigen9GpuDeviceENS2_8internal10SumReducerIfEEE6ReduceINS2_9TensorMapINS2_6TensorIfLi0ELi1ElEELi16ENS2_11MakePointerEEENS9_INSA_IKfLi1ELi1ElEELi16ESC_EENS2_5arrayIlLm1EEEEEvPNS_15OpKernelContextET_T0_RKT1_RKS6+0x1d)[0x7f15657f105d] [5bb4629a2b41:2966884] [ 7] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_cc.so.1(_ZN10tensorflow11ReductionOpIN5Eigen9GpuDeviceEfiNS1_8internal10SumReducerIfEEE7ComputeEPNS_15OpKernelContextE+0x908)[0x7f156576b648] [5bb4629a2b41:2966884] [ 8] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x3cb)[0x7f164816b21b] [5bb4629a2b41:2966884] [ 9] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0x113dab7)[0x7f16481c8ab7] [5bb4629a2b41:2966884] [10] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0x113e11f)[0x7f16481c911f] [5bb4629a2b41:2966884] [11] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x285)[0x7f164827d735] [5bb4629a2b41:2966884] [12] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f164827a278] [5bb4629a2b41:2966884] [13] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0x18d9d90)[0x7f1648964d90] [5bb4629a2b41:2966884] [14] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f1659afd609] [5bb4629a2b41:2966884] [15] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f1659c37133] [5bb4629a2b41:2966884] *** End of error message *** Telemetry data couldn’t be sent, but the command ran successfully. [WARNING]: Insufficient Permissions Execution status: FAIL

:

Could you please try the public KITTI dataset mentioned in the notebook?
If it can work, could you check each label file and run $file xxx.jpg to check each image to narrow down?

I have checked all the labels files. Found ok. But again getting following errors.

2024-04-04 07:38:27,873 [TAO Toolkit] [INFO] main 356: Number of images in the training dataset: 575
2024-04-04 07:38:27,873 [TAO Toolkit] [INFO] main 358: Number of images in the validation dataset: 144
2024-04-04 07:38:28,720 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.common.logging.logging 197: Log file already exists at /workspace/tao-experiments/dssd/experiment_dir_unpruned/status.json
2024-04-04 07:38:35,271 [TAO Toolkit] [INFO] root 2102: Starting Training Loop.
Epoch 1/15
2/288 […] - ETA: 40:47 - loss: 50.0902/usr/local/lib/python3.8/dist-packages/keras/callbacks.py:120: UserWarning: Method on_batch_end() is slow compared to the batch update (2.577003). Check your callbacks.
warnings.warn('Method on_batch_end() is slow compared ’
4/288 […] - ETA: 20:34 - loss: 46.5984DALI daliShareOutput(&pipe_handle_) failed: Critical error in pipeline:
CUDA runtime API error cudaErrorIllegalAddress (700):
an illegal memory access was encountered
Current pipeline object is no longer valid.
2024-04-04 07:38:54.545461: F ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:680] Non-OK-status: GpuLaunchKernel(BlockReduceKernel<IN_T, T*, num_threads, Op>, num_blocks, num_threads, 0, cu_stream, in, (T*)temp_storage.flat<int8_t>().data(), in_size, op, init) status: Internal: an illegal memory access was encountered
[5bb4629a2b41:3197506] *** Process received signal ***
[5bb4629a2b41:3197506] Signal: Aborted (6)
[5bb4629a2b41:3197506] Signal code: (-6)
[5bb4629a2b41:3197506] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f413efda090]
[5bb4629a2b41:3197506] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f413efda00b]
[5bb4629a2b41:3197506] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f413efb9859]
[5bb4629a2b41:3197506] [ 3] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x212f44)[0x7f4139f7ff44]
[5bb4629a2b41:3197506] [ 4] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_cc.so.1(_ZN10tensorflow7functor21LaunchScalarReductionIfNS0_3SumIfEEPfS4_EEvPNS_15OpKernelContextET1_T2_iT0_T_RKP11CUstream_st+0x9db)[0x7f404ac6d0fb]
[5bb4629a2b41:3197506] [ 5] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_cc.so.1(ZN10tensorflow7functor10ReduceImplIfNS0_3SumIfEEPfS4_N5Eigen5arrayIlLm1EEEEEvPNS_15OpKernelContextET1_T2_iiiiiRKT3_T0+0x421)[0x7f404ac6fe01]
[5bb4629a2b41:3197506] [ 6] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_cc.so.1(ZN10tensorflow7functor13ReduceFunctorIN5Eigen9GpuDeviceENS2_8internal10SumReducerIfEEE6ReduceINS2_9TensorMapINS2_6TensorIfLi0ELi1ElEELi16ENS2_11MakePointerEEENS9_INSA_IKfLi1ELi1ElEELi16ESC_EENS2_5arrayIlLm1EEEEEvPNS_15OpKernelContextET_T0_RKT1_RKS6+0x1d)[0x7f404ac7005d]
[5bb4629a2b41:3197506] [ 7] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_cc.so.1(_ZN10tensorflow11ReductionOpIN5Eigen9GpuDeviceEfiNS1_8internal10SumReducerIfEEE7ComputeEPNS_15OpKernelContextE+0x908)[0x7f404abea648]
[5bb4629a2b41:3197506] [ 8] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x3cb)[0x7f41025d821b]
[5bb4629a2b41:3197506] [ 9] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0x113dab7)[0x7f4102635ab7]
[5bb4629a2b41:3197506] [10] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0x113e11f)[0x7f410263611f]
[5bb4629a2b41:3197506] [11] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x285)[0x7f41026ea735]
[5bb4629a2b41:3197506] [12] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f41026e7278]
[5bb4629a2b41:3197506] [13] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0x18d9d90)[0x7f4102dd1d90]
[5bb4629a2b41:3197506] [14] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f413ef7c609]
[5bb4629a2b41:3197506] [15] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f413f0b6133]
[5bb4629a2b41:3197506] *** End of error message ***
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: Insufficient Permissions
Execution status: FAIL

PLease help, i am not able to run this since long tries.

Could you help to share the result of nvidia-smi?
$ nvidia-smi
$ dpkg -l |grep cuda

May I know if you run on a local dgpu machine or a cloud machine(for example, google colab)?

Could you share the label files?
More, if possible, could you share several images as well?
I can check if I can reproduce.

I am running on local dgpu machine

Please find the details.




Please find the labels files below.
class10BC_cmater_14.txt (2.2 KB)
class10BC_cmater_113.txt (1.9 KB)
class10BC_cmater_105.txt (1.7 KB)
class10BC_cmater_1.txt (2.1 KB)

Thanks for the stuff. Which tao version did you use?
Please run
! tao info --verbose

The nvidia-driver is a little old if you use TAO5.

Configuration of the TAO Toolkit Instance

task_group:
model:
dockers:
nvidia/tao/tao-toolkit:
5.0.0-tf2.11.0:
docker_registry: nvcr.io
tasks:
1. classification_tf2
2. efficientdet_tf2
5.0.0-tf1.15.5:
docker_registry: nvcr.io
tasks:
1. bpnet
2. classification_tf1
3. converter
4. detectnet_v2
5. dssd
6. efficientdet_tf1
7. faster_rcnn
8. fpenet
9. lprnet
10. mask_rcnn
11. multitask_classification
12. retinanet
13. ssd
14. unet
15. yolo_v3
16. yolo_v4
17. yolo_v4_tiny
5.2.0-pyt2.1.0:
docker_registry: nvcr.io
tasks:
1. action_recognition
2. centerpose
3. deformable_detr
4. dino
5. mal
6. ml_recog
7. ocdnet
8. ocrnet
9. optical_inspection
10. pointpillars
11. pose_classification
12. re_identification
13. visual_changenet
5.2.0.1-pyt1.14.0:
docker_registry: nvcr.io
tasks:
1. classification_pyt
2. segformer
dataset:
dockers:
nvidia/tao/tao-toolkit:
5.2.0-data-services:
docker_registry: nvcr.io
tasks:
1. augmentation
2. auto_label
3. annotations
4. analytics
deploy:
dockers:
nvidia/tao/tao-toolkit:
5.2.0-deploy:
docker_registry: nvcr.io
tasks:
1. visual_changenet
2. centerpose
3. classification_pyt
4. classification_tf1
5. classification_tf2
6. deformable_detr
7. detectnet_v2
8. dino
9. dssd
10. efficientdet_tf1
11. efficientdet_tf2
12. faster_rcnn
13. lprnet
14. mask_rcnn
15. ml_recog
16. multitask_classification
17. ocdnet
18. ocrnet
19. optical_inspection
20. retinanet
21. segformer
22. ssd
23. trtexec
24. unet
25. yolo_v3
26. yolo_v4
27. yolo_v4_tiny
format_version: 3.0
toolkit_version: 5.2.0.1
published_date: 01/16/2024
This notebook has come to an end.

I did not use the same setting as yours. When I set as below, there is no error during training.

data_sources: {
#tfrecords_path: "/workspace/tao-experiments/data/ssd/tfrecords/kitti_train*"
image_directory_path: "/home/morganh/demo_3.0/forum_repro/ssd_285706/data/image"
label_directory_path: "/home/morganh/demo_3.0/forum_repro/ssd_285706/data/label"
}

My log as attached.
20240410_forum_285706.txt (50.8 KB)

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

More, I find an issue in the label file.
For the first txt file, why some values are larger than image’s resolution(1920x2218)?

mathexp 0.0 0 0.0 811.44 2591.12 980.19 2713.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0
mathsym 0.0 0 0.0 673.94 2397.38 773.94 2463.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0
mathsym 0.0 0 0.0 755.19 2528.62 855.19 2584.88 0.0 0.0 0.0 0.0 0.0 0.0 0.0
mathsym 0.0 0 0.0 733.31 2613.00 814.56 2691.12 0.0 0.0 0.0 0.0 0.0 0.0 0.0

BTW, you can also use YOLOv4 to train to get a better mAP.