Hello,
When I was training with jupyterlab, I encountered the following error, please help to analyze the cause? Thanks!
GPU model: GeForce RTX3070 Laptop GPU
ssd_retrain_resnet18_kitti:
random_seed: 42
ssd_config {
aspect_ratios_global: "[1.0, 2.0, 0.5, 3.0, 1.0/3.0]"
scales: "[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]"
two_boxes_for_ar1: true
clip_boxes: false
variances: "[0.1, 0.1, 0.2, 0.2]"
arch: "resnet"
nlayers: 18
freeze_bn: false
}
training_config {
batch_size_per_gpu: 32
num_epochs: 80
enable_qat: false
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-5
max_learning_rate: 2e-2
soft_start: 0.1
annealing: 0.6
}
}
regularizer {
type: NO_REG
weight: 3e-9
}
}
eval_config {
validation_period_during_training: 10
average_precision_mode: SAMPLE
batch_size: 32
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.01
clustering_iou_threshold: 0.6
top_k: 200
}
augmentation_config {
output_width: 300
output_height: 300
output_channel: 3
}
dataset_config {
data_sources: {
tfrecords_path: "/workspace/tao-experiments/data/tfrecords/kitti_train*"
}
include_difficult_in_training: true
target_class_mapping {
key: "car"
value: "car"
}
target_class_mapping {
key: "pedestrian"
value: "pedestrian"
}
target_class_mapping {
key: "cyclist"
value: "cyclist"
}
target_class_mapping {
key: "van"
value: "car"
}
target_class_mapping {
key: "person_sitting"
value: "pedestrian"
}
validation_data_sources: {
label_directory_path: "/workspace/tao-experiments/data/val/label"
image_directory_path: "/workspace/tao-experiments/data/val/image"
}
}
generate_val_dataset.py:
# Copyright (c) 2017-2020, NVIDIA CORPORATION. All rights reserved.
"""Script to generate val dataset for SSD/DSSD tutorial."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import os
def parse_args(args=None):
"""parse the arguments."""
parser = argparse.ArgumentParser(description='Generate val dataset for SSD/DSSD tutorial')
parser.add_argument(
"--input_image_dir",
type=str,
required=True,
help="Input directory to KITTI training dataset images."
)
parser.add_argument(
"--input_label_dir",
type=str,
required=True,
help="Input directory to KITTI training dataset labels."
)
parser.add_argument(
"--output_dir",
type=str,
required=True,
help="Ouput directory to TLT val dataset."
)
parser.add_argument(
"--val_split",
type=int,
required=False,
default=10,
help="Percentage of training dataset for generating val dataset"
)
return parser.parse_args(args)
def main(args=None):
"""Main function for data preparation."""
args = parse_args(args)
img_files = []
for file_name in os.listdir(args.input_image_dir):
if file_name.split(".")[-1] == "png":
img_files.append(file_name)
total_cnt = len(img_files)
val_ratio = float(args.val_split) / 100.0
val_cnt = int(total_cnt * val_ratio)
train_cnt = total_cnt - val_cnt
val_img_list = img_files[0:val_cnt]
target_img_path = os.path.join(args.output_dir, "image")
target_label_path = os.path.join(args.output_dir, "label")
if not os.path.exists(target_img_path):
os.makedirs(target_img_path)
else:
print("This script will not run as output image path already exists.")
return
if not os.path.exists(target_label_path):
os.makedirs(target_label_path)
else:
print("This script will not run as output label path already exists.")
return
print("Total {} samples in KITTI training dataset".format(total_cnt))
print("{} for train and {} for val".format(train_cnt, val_cnt))
for img_name in val_img_list:
#label_name = img_name.split(".")[0] + ".txt"
label_name = ".".join(img_name.split(".")[:-1]) + ".txt"
os.rename(os.path.join(args.input_image_dir, img_name),
os.path.join(target_img_path, img_name))
os.rename(os.path.join(args.input_label_dir, label_name),
os.path.join(target_label_path, label_name))
if __name__ == "__main__":
main()
Run TAO training:
print("To run with multigpu, please change --gpus based on the number of available GPUs in your machine.")
!tao ssd train --gpus 1 --gpu_index=$GPU_INDEX \
-e $SPECS_DIR/ssd_train_resnet18_kitti.txt \
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
-k $KEY \
-m $USER_EXPERIMENT_DIR/pretrained_resnet18/pretrained_object_detection_vresnet18/resnet_18.hdf5
Error message:
2022-11-28 03:33:49,837 [INFO] iva.common.logging.logging: Log file already exists at /workspace/tao-experiments/ssd/experiment_dir_unpruned/status.json
2022-11-28 03:33:52,707 [INFO] root: Starting Training Loop.
Epoch 1/80
2/421 […] - ETA: 43:51 - loss: 49.4638 /usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (1.056582). Check your callbacks.
% delta_t_median)
144/421 [=========>…] - ETA: 52s - loss: 24.9556libpng error: IDAT: CRC error
145/421 [=========>…] - ETA: 51s - loss: 24.9197DALI daliShareOutput(&pipe_handle_) failed: Critical error in pipeline:
Error when executing Mixed operator ImageDecoder encountered:
Error in thread 0: [/opt/dali/dali/operators/decoder/nvjpeg/nvjpeg_helper.h:165] [/opt/dali/dali/image/generic_image.cc:38] Assert on “decoded_image.data != nullptr” failed: Unsupported image type.
Stacktrace (9 entries):
[frame 0]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8e7ce) [0x7f32a1ada7ce]
[frame 1]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x193792) [0x7f32a1bdf792]
[frame 2]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::Image::Decode()+0x34) [0x7f32a1be06e4]
[frame 3]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb354) [0x7f32a3110354]
[frame 4]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb8fb) [0x7f32a31108fb]
[frame 5]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool)+0x217) [0x7f32a1bb45a7]
[frame 6]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8a213f) [0x7f32a22ee13f]
[frame 7]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f33949a3609]
[frame 8]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f3394adf293]
. File: /workspace/tao-experiments/data/tfrecords/kitti_train-fold-000-of-002-shard-00003-of-00010 at index 273547791
Stacktrace (7 entries):
[frame 0]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x413ace) [0x7f32a2d28ace]
[frame 1]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb57e) [0x7f32a311057e]
[frame 2]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb8fb) [0x7f32a31108fb]
[frame 3]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool)+0x217) [0x7f32a1bb45a7]
[frame 4]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8a213f) [0x7f32a22ee13f]
[frame 5]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f33949a3609]
[frame 6]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f3394adf293]
Current pipeline object is no longer valid.
2022-11-28 03:34:19,942 [INFO] root: 2 root error(s) found.
(0) Internal: DALI daliShareOutput(&pipe_handle_) failed: Critical error in pipeline:
Error when executing Mixed operator ImageDecoder encountered:
Error in thread 0: [/opt/dali/dali/operators/decoder/nvjpeg/nvjpeg_helper.h:165] [/opt/dali/dali/image/generic_image.cc:38] Assert on “decoded_image.data != nullptr” failed: Unsupported image type.
Stacktrace (9 entries):
[frame 0]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8e7ce) [0x7f32a1ada7ce]
[frame 1]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x193792) [0x7f32a1bdf792]
[frame 2]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::Image::Decode()+0x34) [0x7f32a1be06e4]
[frame 3]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb354) [0x7f32a3110354]
[frame 4]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb8fb) [0x7f32a31108fb]
[frame 5]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool)+0x217) [0x7f32a1bb45a7]
[frame 6]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8a213f) [0x7f32a22ee13f]
[frame 7]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f33949a3609]
[frame 8]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f3394adf293]
. File: /workspace/tao-experiments/data/tfrecords/kitti_train-fold-000-of-002-shard-00003-of-00010 at index 273547791
Stacktrace (7 entries):
[frame 0]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x413ace) [0x7f32a2d28ace]
[frame 1]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb57e) [0x7f32a311057e]
[frame 2]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb8fb) [0x7f32a31108fb]
[frame 3]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool)+0x217) [0x7f32a1bb45a7]
[frame 4]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8a213f) [0x7f32a22ee13f]
[frame 5]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f33949a3609]
[frame 6]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f3394adf293]
Current pipeline object is no longer valid.
[[{{node Dali}}]]
[[cond_13/MultiMatch/ArithmeticOptimizer/ReorderCastLikeAndValuePreserving_int32_Reshape_1/4677]]
(1) Internal: DALI daliShareOutput(&pipe_handle) failed: Critical error in pipeline:
Error when executing Mixed operator ImageDecoder encountered:
Error in thread 0: [/opt/dali/dali/operators/decoder/nvjpeg/nvjpeg_helper.h:165] [/opt/dali/dali/image/generic_image.cc:38] Assert on “decoded_image.data != nullptr” failed: Unsupported image type.
Stacktrace (9 entries):
[frame 0]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8e7ce) [0x7f32a1ada7ce]
[frame 1]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x193792) [0x7f32a1bdf792]
[frame 2]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::Image::Decode()+0x34) [0x7f32a1be06e4]
[frame 3]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb354) [0x7f32a3110354]
[frame 4]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb8fb) [0x7f32a31108fb]
[frame 5]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool)+0x217) [0x7f32a1bb45a7]
[frame 6]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8a213f) [0x7f32a22ee13f]
[frame 7]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f33949a3609]
[frame 8]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f3394adf293]
. File: /workspace/tao-experiments/data/tfrecords/kitti_train-fold-000-of-002-shard-00003-of-00010 at index 273547791
Stacktrace (7 entries):
[frame 0]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x413ace) [0x7f32a2d28ace]
[frame 1]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb57e) [0x7f32a311057e]
[frame 2]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb8fb) [0x7f32a31108fb]
[frame 3]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool)+0x217) [0x7f32a1bb45a7]
[frame 4]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8a213f) [0x7f32a22ee13f]
[frame 5]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f33949a3609]
[frame 6]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f3394adf293]
Current pipeline object is no longer valid.
[[{{node Dali}}]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py”, line 441, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 707, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 695, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py”, line 437, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py”, line 423, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py”, line 329, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1039, in fit
validation_steps=validation_steps)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py”, line 154, in fit_loop
outs = f(ins)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2715, in call
return self._call(inputs)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2675, in _call
fetched = self.callable_fn(*array_vals)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1472, in call
run_metadata_ptr)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: DALI daliShareOutput(&pipe_handle) failed: Critical error in pipeline:
Error when executing Mixed operator ImageDecoder encountered:
Error in thread 0: [/opt/dali/dali/operators/decoder/nvjpeg/nvjpeg_helper.h:165] [/opt/dali/dali/image/generic_image.cc:38] Assert on “decoded_image.data != nullptr” failed: Unsupported image type.
Stacktrace (9 entries):
[frame 0]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8e7ce) [0x7f32a1ada7ce]
[frame 1]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x193792) [0x7f32a1bdf792]
[frame 2]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::Image::Decode()+0x34) [0x7f32a1be06e4]
[frame 3]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb354) [0x7f32a3110354]
[frame 4]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb8fb) [0x7f32a31108fb]
[frame 5]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool)+0x217) [0x7f32a1bb45a7]
[frame 6]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8a213f) [0x7f32a22ee13f]
[frame 7]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f33949a3609]
[frame 8]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f3394adf293]
. File: /workspace/tao-experiments/data/tfrecords/kitti_train-fold-000-of-002-shard-00003-of-00010 at index 273547791
Stacktrace (7 entries):
[frame 0]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x413ace) [0x7f32a2d28ace]
[frame 1]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb57e) [0x7f32a311057e]
[frame 2]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb8fb) [0x7f32a31108fb]
[frame 3]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool)+0x217) [0x7f32a1bb45a7]
[frame 4]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8a213f) [0x7f32a22ee13f]
[frame 5]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f33949a3609]
[frame 6]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)
…