Error occurred when using jupyterlab for data training

Hello,
When I was training with jupyterlab, I encountered the following error, please help to analyze the cause? Thanks!
GPU model: GeForce RTX3070 Laptop GPU
ssd_retrain_resnet18_kitti:

random_seed: 42
ssd_config {
  aspect_ratios_global: "[1.0, 2.0, 0.5, 3.0, 1.0/3.0]"
  scales: "[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]"
  two_boxes_for_ar1: true
  clip_boxes: false
  variances: "[0.1, 0.1, 0.2, 0.2]"
  arch: "resnet"
  nlayers: 18
  freeze_bn: false
}
training_config {
  batch_size_per_gpu: 32
  num_epochs: 80
  enable_qat: false
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 5e-5
    max_learning_rate: 2e-2
    soft_start: 0.1
    annealing: 0.6
    }
  }
  regularizer {
    type: NO_REG
    weight: 3e-9
  }
}
eval_config {
  validation_period_during_training: 10
  average_precision_mode: SAMPLE
  batch_size: 32
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.01
  clustering_iou_threshold: 0.6
  top_k: 200
}
augmentation_config {
    output_width: 300
    output_height: 300
    output_channel: 3
}
dataset_config {
  data_sources: {
    tfrecords_path: "/workspace/tao-experiments/data/tfrecords/kitti_train*"
  }
  include_difficult_in_training: true
  target_class_mapping {
      key: "car"
      value: "car"
  }
  target_class_mapping {
      key: "pedestrian"
      value: "pedestrian"
  }
  target_class_mapping {
      key: "cyclist"
      value: "cyclist"
  }
  target_class_mapping {
      key: "van"
      value: "car"
  }
  target_class_mapping {
      key: "person_sitting"
      value: "pedestrian"
  }
  validation_data_sources: {
      label_directory_path: "/workspace/tao-experiments/data/val/label"
      image_directory_path: "/workspace/tao-experiments/data/val/image"
  }
}

generate_val_dataset.py:

# Copyright (c) 2017-2020, NVIDIA CORPORATION.  All rights reserved.

"""Script to generate val dataset for SSD/DSSD tutorial."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import os


def parse_args(args=None):
    """parse the arguments."""
    parser = argparse.ArgumentParser(description='Generate val dataset for SSD/DSSD tutorial')

    parser.add_argument(
        "--input_image_dir",
        type=str,
        required=True,
        help="Input directory to KITTI training dataset images."
    )

    parser.add_argument(
        "--input_label_dir",
        type=str,
        required=True,
        help="Input directory to KITTI training dataset labels."
    )

    parser.add_argument(
        "--output_dir",
        type=str,
        required=True,
        help="Ouput directory to TLT val dataset."
    )

    parser.add_argument(
        "--val_split",
        type=int,
        required=False,
        default=10,
        help="Percentage of training dataset for generating val dataset"
    )

    return parser.parse_args(args)


def main(args=None):
    """Main function for data preparation."""

    args = parse_args(args)

    img_files = []
    for file_name in os.listdir(args.input_image_dir):
        if file_name.split(".")[-1] == "png":
            img_files.append(file_name)

    total_cnt = len(img_files)
    val_ratio = float(args.val_split) / 100.0
    val_cnt = int(total_cnt * val_ratio)
    train_cnt = total_cnt - val_cnt
    val_img_list = img_files[0:val_cnt]

    target_img_path = os.path.join(args.output_dir, "image")
    target_label_path = os.path.join(args.output_dir, "label")

    if not os.path.exists(target_img_path):
        os.makedirs(target_img_path)
    else:
        print("This script will not run as output image path already exists.")
        return

    if not os.path.exists(target_label_path):
        os.makedirs(target_label_path)
    else:
        print("This script will not run as output label path already exists.")
        return

    print("Total {} samples in KITTI training dataset".format(total_cnt))
    print("{} for train and {} for val".format(train_cnt, val_cnt))

    for img_name in val_img_list:
        #label_name = img_name.split(".")[0] + ".txt"
        label_name = ".".join(img_name.split(".")[:-1]) + ".txt"
        os.rename(os.path.join(args.input_image_dir, img_name),
                  os.path.join(target_img_path, img_name))
        os.rename(os.path.join(args.input_label_dir, label_name),
                  os.path.join(target_label_path, label_name))


if __name__ == "__main__":
    main()

Run TAO training:

print("To run with multigpu, please change --gpus based on the number of available GPUs in your machine.")
!tao ssd train --gpus 1 --gpu_index=$GPU_INDEX \
               -e $SPECS_DIR/ssd_train_resnet18_kitti.txt \
               -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
               -k $KEY \
               -m $USER_EXPERIMENT_DIR/pretrained_resnet18/pretrained_object_detection_vresnet18/resnet_18.hdf5

Error message:
2022-11-28 03:33:49,837 [INFO] iva.common.logging.logging: Log file already exists at /workspace/tao-experiments/ssd/experiment_dir_unpruned/status.json
2022-11-28 03:33:52,707 [INFO] root: Starting Training Loop.
Epoch 1/80
2/421 […] - ETA: 43:51 - loss: 49.4638 /usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (1.056582). Check your callbacks.
% delta_t_median)
144/421 [=========>…] - ETA: 52s - loss: 24.9556libpng error: IDAT: CRC error
145/421 [=========>…] - ETA: 51s - loss: 24.9197DALI daliShareOutput(&pipe_handle_) failed: Critical error in pipeline:
Error when executing Mixed operator ImageDecoder encountered:
Error in thread 0: [/opt/dali/dali/operators/decoder/nvjpeg/nvjpeg_helper.h:165] [/opt/dali/dali/image/generic_image.cc:38] Assert on “decoded_image.data != nullptr” failed: Unsupported image type.
Stacktrace (9 entries):
[frame 0]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8e7ce) [0x7f32a1ada7ce]
[frame 1]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x193792) [0x7f32a1bdf792]
[frame 2]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::Image::Decode()+0x34) [0x7f32a1be06e4]
[frame 3]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb354) [0x7f32a3110354]
[frame 4]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb8fb) [0x7f32a31108fb]
[frame 5]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool)+0x217) [0x7f32a1bb45a7]
[frame 6]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8a213f) [0x7f32a22ee13f]
[frame 7]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f33949a3609]
[frame 8]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f3394adf293]
. File: /workspace/tao-experiments/data/tfrecords/kitti_train-fold-000-of-002-shard-00003-of-00010 at index 273547791
Stacktrace (7 entries):
[frame 0]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x413ace) [0x7f32a2d28ace]
[frame 1]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb57e) [0x7f32a311057e]
[frame 2]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb8fb) [0x7f32a31108fb]
[frame 3]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool)+0x217) [0x7f32a1bb45a7]
[frame 4]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8a213f) [0x7f32a22ee13f]
[frame 5]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f33949a3609]
[frame 6]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f3394adf293]

Current pipeline object is no longer valid.
2022-11-28 03:34:19,942 [INFO] root: 2 root error(s) found.
(0) Internal: DALI daliShareOutput(&pipe_handle_) failed: Critical error in pipeline:
Error when executing Mixed operator ImageDecoder encountered:
Error in thread 0: [/opt/dali/dali/operators/decoder/nvjpeg/nvjpeg_helper.h:165] [/opt/dali/dali/image/generic_image.cc:38] Assert on “decoded_image.data != nullptr” failed: Unsupported image type.
Stacktrace (9 entries):
[frame 0]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8e7ce) [0x7f32a1ada7ce]
[frame 1]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x193792) [0x7f32a1bdf792]
[frame 2]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::Image::Decode()+0x34) [0x7f32a1be06e4]
[frame 3]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb354) [0x7f32a3110354]
[frame 4]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb8fb) [0x7f32a31108fb]
[frame 5]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool)+0x217) [0x7f32a1bb45a7]
[frame 6]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8a213f) [0x7f32a22ee13f]
[frame 7]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f33949a3609]
[frame 8]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f3394adf293]
. File: /workspace/tao-experiments/data/tfrecords/kitti_train-fold-000-of-002-shard-00003-of-00010 at index 273547791
Stacktrace (7 entries):
[frame 0]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x413ace) [0x7f32a2d28ace]
[frame 1]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb57e) [0x7f32a311057e]
[frame 2]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb8fb) [0x7f32a31108fb]
[frame 3]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool)+0x217) [0x7f32a1bb45a7]
[frame 4]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8a213f) [0x7f32a22ee13f]
[frame 5]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f33949a3609]
[frame 6]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f3394adf293]

Current pipeline object is no longer valid.
[[{{node Dali}}]]
[[cond_13/MultiMatch/ArithmeticOptimizer/ReorderCastLikeAndValuePreserving_int32_Reshape_1/4677]]
(1) Internal: DALI daliShareOutput(&pipe_handle
) failed: Critical error in pipeline:
Error when executing Mixed operator ImageDecoder encountered:
Error in thread 0: [/opt/dali/dali/operators/decoder/nvjpeg/nvjpeg_helper.h:165] [/opt/dali/dali/image/generic_image.cc:38] Assert on “decoded_image.data != nullptr” failed: Unsupported image type.
Stacktrace (9 entries):
[frame 0]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8e7ce) [0x7f32a1ada7ce]
[frame 1]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x193792) [0x7f32a1bdf792]
[frame 2]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::Image::Decode()+0x34) [0x7f32a1be06e4]
[frame 3]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb354) [0x7f32a3110354]
[frame 4]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb8fb) [0x7f32a31108fb]
[frame 5]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool)+0x217) [0x7f32a1bb45a7]
[frame 6]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8a213f) [0x7f32a22ee13f]
[frame 7]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f33949a3609]
[frame 8]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f3394adf293]
. File: /workspace/tao-experiments/data/tfrecords/kitti_train-fold-000-of-002-shard-00003-of-00010 at index 273547791
Stacktrace (7 entries):
[frame 0]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x413ace) [0x7f32a2d28ace]
[frame 1]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb57e) [0x7f32a311057e]
[frame 2]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb8fb) [0x7f32a31108fb]
[frame 3]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool)+0x217) [0x7f32a1bb45a7]
[frame 4]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8a213f) [0x7f32a22ee13f]
[frame 5]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f33949a3609]
[frame 6]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f3394adf293]

Current pipeline object is no longer valid.
[[{{node Dali}}]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py”, line 441, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 707, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 695, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py”, line 437, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py”, line 423, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py”, line 329, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1039, in fit
validation_steps=validation_steps)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py”, line 154, in fit_loop
outs = f(ins)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2715, in call
return self._call(inputs)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2675, in _call
fetched = self.callable_fn(*array_vals)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1472, in call
run_metadata_ptr)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: DALI daliShareOutput(&pipe_handle
) failed: Critical error in pipeline:
Error when executing Mixed operator ImageDecoder encountered:
Error in thread 0: [/opt/dali/dali/operators/decoder/nvjpeg/nvjpeg_helper.h:165] [/opt/dali/dali/image/generic_image.cc:38] Assert on “decoded_image.data != nullptr” failed: Unsupported image type.
Stacktrace (9 entries):
[frame 0]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8e7ce) [0x7f32a1ada7ce]
[frame 1]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x193792) [0x7f32a1bdf792]
[frame 2]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::Image::Decode()+0x34) [0x7f32a1be06e4]
[frame 3]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb354) [0x7f32a3110354]
[frame 4]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb8fb) [0x7f32a31108fb]
[frame 5]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool)+0x217) [0x7f32a1bb45a7]
[frame 6]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8a213f) [0x7f32a22ee13f]
[frame 7]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f33949a3609]
[frame 8]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f3394adf293]
. File: /workspace/tao-experiments/data/tfrecords/kitti_train-fold-000-of-002-shard-00003-of-00010 at index 273547791
Stacktrace (7 entries):
[frame 0]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x413ace) [0x7f32a2d28ace]
[frame 1]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb57e) [0x7f32a311057e]
[frame 2]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x7fb8fb) [0x7f32a31108fb]
[frame 3]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool)+0x217) [0x7f32a1bb45a7]
[frame 4]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8a213f) [0x7f32a22ee13f]
[frame 5]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f33949a3609]
[frame 6]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)

From the log, could you double check the training dataset?
Is there any hidden file or unexpected file?

Thank you.
The problem has been solved. The reason is that some files in the dataset are damaged, which makes Jupyter unable to recognize them.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.