DataLossError: corrupted record at 0 when using resnet18

using docker nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3
I try to run resnet18 model using the second Jupiter notebook and appear DataLoss error

Even tfrecords has an error

2022-06-19 10:21:40,108 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Tfrecords generation complete.
2022-06-19 10:21:40,108 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing the log_warning.json
2022-06-19 10:21:40,108 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: There were errors in the labels. Details are logged at /workspace/tfrecords/tfrecords_waring.json
2022-06-19 10:21:40,928 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Can you run below successfully?
! tao detectnet_v2 train -h

No DataLoss Error appears and then training stopped

Can you share the latest log and command?

1- Download dataset from: widerperson kitti | Kaggle
2- write specs: tfrecord_spec = “”"
kitti_config {
root_directory_path: “/workspace/dataset”
image_dir_name: “images”
label_dir_name: “labels”
image_extension: “.jpg”
partition_mode: “random”
num_partitions: 2
val_split: 14
num_shards: 10
}
image_directory_path: “/workspace/dataset”
“”"
3- convert dataset: !tao detectnet_v2 dataset_convert
-d /workspace/specs/detectnet_v2_tfrecords_kitti_trainval.txt
-o /workspace/tfrecords/tfrecords

4- write training specs:
training_spec = “”"
random_seed: 42

dataset_config {
data_sources {
tfrecords_path: “/workspace/tfrecords/*”
image_directory_path: “/workspace/dataset”
}
image_extension: “jpg”
target_class_mapping {
key: “person”
value: “person”
}
validation_fold: 0
}

augmentation_config {
preprocessing {
output_image_width: 960
output_image_height: 544
min_bbox_width: 1.0
min_bbox_height: 1.0
output_image_channel: 3
}
spatial_augmentation {
hflip_probability: 0.5
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.20000000298
contrast_scale_max: 0.10000000149
contrast_center: 0.5
}
}

postprocessing_config {
target_class_config {
key: “person”
value {
clustering_config {
clustering_algorithm: DBSCAN
dbscan_confidence_threshold: 0.9
coverage_threshold: 0.00499999988824
dbscan_eps: 0.20000000298
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 20
}
}
}
}

model_config {
pretrained_model_file: “/workspace/weights/pretrained/pretrained_detectnet_v2_vresnet18/resnet18.hdf5”
num_layers: 18
use_batch_norm: true
objective_set {
bbox {
scale: 35.0
offset: 0.5
}
cov {
}
}
arch: “resnet”
}

evaluation_config {
validation_period_during_training: 10
first_validation_epoch: 30

minimum_detection_ground_truth_overlap {
key: “person”
value: 0.699999988079
}

evaluation_box_config {
key: “person”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}

average_precision_mode: INTEGRATE
}

cost_function_config {
target_classes {
name: “person”
class_weight: 1.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
enable_autoweighting: true
max_objective_weight: 0.999899983406
min_objective_weight: 9.99999974738e-05
}

training_config {
batch_size_per_gpu: 4
num_epochs: 120
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-06
max_learning_rate: 5e-04
soft_start: 0.10000000149
annealing: 0.699999988079
}
}
regularizer {
type: L1
weight: 3.00000002618e-09
}
optimizer {
adam {
epsilon: 9.99999993923e-09
beta1: 0.899999976158
beta2: 0.999000012875
}
}
cost_scaling {
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
}
checkpoint_interval: 10
}

bbox_rasterizer_config {
target_class_config {
key: “person”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.40000000596
cov_radius_y: 0.40000000596
bbox_min_radius: 1.0
}
}

deadzone_radius: 0.400000154972
}

“”"

with open(“specs/detectnet_v2_train_resnet18_kitti.txt”, “w”) as f:
f.write(training_spec)

5- train model: !tao detectnet_v2 train -e /workspace/specs/detectnet_v2_train_resnet18_kitti.txt
-r weights/trained_temp
-k $API_KEY
-n resnet18_detector
–gpus 1

train log output:
2022-06-19 12:16:54,628 [INFO] root: Registry: [‘nvcr.io’]
2022-06-19 12:16:54,742 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3
2022-06-19 12:16:54,812 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/root/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn’t match a supported version!
RequestsDependencyWarning)
Using TensorFlow backend.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py:43: The name tf.train.SessionRunHook is deprecated. Please use tf.estimator.SessionRunHook instead.

2022-06-19 12:17:01,415 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py:43: The name tf.train.SessionRunHook is deprecated. Please use tf.estimator.SessionRunHook instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

2022-06-19 12:17:01,533 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

2022-06-19 12:17:02,056 [INFO] root: Starting DetectNet_v2 Training job
2022-06-19 12:17:02,056 [INFO] main: Loading experiment spec at /workspace/specs/detectnet_v2_train_resnet18_kitti.txt.
2022-06-19 12:17:02,057 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/specs/detectnet_v2_train_resnet18_kitti.txt
2022-06-19 12:17:02,060 [INFO] root: Training gridbox model.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2022-06-19 12:17:02,060 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 917, in
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 906, in
File “”, line 2, in main
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 893, in main
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 757, in run_experiment
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 626, in train_gridbox
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/build_dataloader.py”, line 273, in build_dataloader
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py”, line 491, in init
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py”, line 548, in _construct_data_sources
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py”, line 395, in init
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py”, line 395, in
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py”, line 394, in
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/lib/io/tf_record.py”, line 181, in tf_record_iterator
reader.GetNext()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py”, line 1034, in GetNext
return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
2022-06-19 12:17:03,083 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Can you run below and share the log?
! tao detectnet_v2 run ls -rltsh /workspace/tfrecords/tfrecords

Command ! tao detectnet_v2 run ls -rltsh /workspace/tfrecords/tfrecords

output
2022-06-20 06:47:28,321 [INFO] root: Registry: [‘nvcr.io’]
2022-06-20 06:47:28,434 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3
2022-06-20 06:47:28,502 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/root/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
ls: cannot access ‘/workspace/tfrecords/tfrecords’: No such file or directory
2022-06-20 06:47:30,267 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Command: ! tao detectnet_v2 run ls -rltsh /workspace/tfrecords/

output
2022-06-20 06:48:33,683 [INFO] root: Registry: [‘nvcr.io’]
2022-06-20 06:48:33,797 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3
2022-06-20 06:48:33,865 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/root/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
total 16M
224K -rw-r–r-- 1 root root 224K Jun 19 10:50 tfrecords-fold-000-of-002-shard-00000-of-00010
236K -rw-r–r-- 1 root root 233K Jun 19 10:50 tfrecords-fold-000-of-002-shard-00001-of-00010
212K -rw-r–r-- 1 root root 212K Jun 19 10:50 tfrecords-fold-000-of-002-shard-00002-of-00010
212K -rw-r–r-- 1 root root 211K Jun 19 10:50 tfrecords-fold-000-of-002-shard-00003-of-00010
220K -rw-r–r-- 1 root root 219K Jun 19 10:50 tfrecords-fold-000-of-002-shard-00004-of-00010
232K -rw-r–r-- 1 root root 232K Jun 19 10:50 tfrecords-fold-000-of-002-shard-00005-of-00010
244K -rw-r–r-- 1 root root 242K Jun 19 10:50 tfrecords-fold-000-of-002-shard-00006-of-00010
208K -rw-r–r-- 1 root root 206K Jun 19 10:50 tfrecords-fold-000-of-002-shard-00007-of-00010
192K -rw-r–r-- 1 root root 192K Jun 19 10:50 tfrecords-fold-000-of-002-shard-00008-of-00010
236K -rw-r–r-- 1 root root 234K Jun 19 10:50 tfrecords-fold-000-of-002-shard-00009-of-00010
1.4M -rw-r–r-- 1 root root 1.4M Jun 19 10:50 tfrecords-fold-001-of-002-shard-00000-of-00010
1.4M -rw-r–r-- 1 root root 1.4M Jun 19 10:50 tfrecords-fold-001-of-002-shard-00001-of-00010
1.3M -rw-r–r-- 1 root root 1.3M Jun 19 10:51 tfrecords-fold-001-of-002-shard-00002-of-00010
1.4M -rw-r–r-- 1 root root 1.4M Jun 19 10:51 tfrecords-fold-001-of-002-shard-00003-of-00010
1.4M -rw-r–r-- 1 root root 1.4M Jun 19 10:51 tfrecords-fold-001-of-002-shard-00004-of-00010
1.4M -rw-r–r-- 1 root root 1.3M Jun 19 10:51 tfrecords-fold-001-of-002-shard-00005-of-00010
1.3M -rw-r–r-- 1 root root 1.3M Jun 19 10:51 tfrecords-fold-001-of-002-shard-00006-of-00010
1.3M -rw-r–r-- 1 root root 1.3M Jun 19 10:51 tfrecords-fold-001-of-002-shard-00007-of-00010
1.4M -rw-r–r-- 1 root root 1.4M Jun 19 10:51 tfrecords-fold-001-of-002-shard-00008-of-00010
4.0K -rw-r–r-- 1 root root 287 Jun 19 10:51 tfrecords_warning.json
1.4M -rw-r–r-- 1 root root 1.4M Jun 19 10:51 tfrecords-fold-001-of-002-shard-00009-of-00010
2022-06-20 06:48:35,875 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Please cat below file and share the log. Thanks.
! tao detectnet_v2 run cat /workspace/tfrecords/tfrecords_waring.json

Command: ! tao detectnet_v2 run cat /workspace/tfrecords/tfrecords_warning.json

output:
! tao detectnet_v2 run cat /workspace/tfrecords/tfrecords_waring.json

output command:

! tao detectnet_v2 run cat /workspace/tfrecords/tfrecords_warning.json

2022-06-20 06:55:07,671 [INFO] root: Registry: [‘nvcr.io’]
2022-06-20 06:55:07,786 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3
2022-06-20 06:55:07,855 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/root/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
{
“/workspace/dataset/labels/011765_544x960.txt_13”: [
960,
164,
960,
335
],
“/workspace/dataset/labels/011765_544x960.txt_14”: [
960,
163,
960,
364
],
“/workspace/dataset/labels/011765_544x960.txt_15”: [
960,
164,
960,
347
]
}2022-06-20 06:55:09,592 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Please try below firstly.
! tao detectnet_v2 run rm /workspace/tfrecords/tfrecords_warning.json

and then run below again.
! tao detectnet_v2 train xxx

Or you can specify above to
tfrecords_path: “/workspace/tfrecords/tfrecords-*”

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.