An error occurred while training with TLT

I used TLT3.0 Faster-RCNN to train the custom data set and the following error occurred:

Traceback (most recent call last):
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/faster_rcnn/scripts/train.py”, line 74, in
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/faster_rcnn/scripts/train.py”, line 66, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/faster_rcnn/models/utils.py”, line 407, in build_or_resume_model
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/faster_rcnn/data_loader/inputs_loader.py”, line 78, in init
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py”, line 653, in get_dataset_tensors
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/trainers/multi_task_trainer/data_loader_interface.py”, line 77, in call
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/data_loader.py”, line 396, in call
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 2081, in apply
return DatasetV1Adapter(super(DatasetV1, self).apply(transformation_func))
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 1422, in apply
dataset = transformation_func(self)
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py”, line 371, in
File “/opt/nvidia/third_party/keras/tensorflow_backend.py”, line 356, in new_map
self, _map_func_set_random_wrapper, num_parallel_calls=num_parallel_calls
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 2000, in map
MapDataset(self, map_func, preserve_cardinality=False))
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 3531, in init
use_legacy_function=use_legacy_function)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 2810, in init
self._function = wrapper_fn._get_concrete_function_internal()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py”, line 1853, in _get_concrete_function_internal
*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py”, line 1847, in _get_concrete_function_internal_garbage_collected
graph_function, _, _ = self._maybe_define_function(args, kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py”, line 2147, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py”, line 2038, in _create_graph_function
capture_by_value=self._capture_by_value),
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/func_graph.py”, line 915, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 2804, in wrapper_fn
ret = _wrapper_helper(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 2749, in _wrapper_helper
ret = autograph.tf_convert(func, ag_ctx)(*nested_args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py”, line 237, in wrapper
raise e.ag_error_metadata.to_exception(e)
StopIteration: in converted code:

/opt/nvidia/third_party/keras/tensorflow_backend.py:353 _map_func_set_random_wrapper  *
    return map_func(*args, **kwargs)
/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py:136 __call__
    
/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py:104 _get_parse_example
    
/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/utilities.py:217 extract_tfrecords_features
    

StopIteration: 

Traceback (most recent call last):
File “/usr/local/bin/faster_rcnn”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/faster_rcnn/entrypoint/faster_rcnn.py”, line 12, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 296, in launch_job
AssertionError: Process run failed.

any idea?thanks

1 Like

Please check your tfrecords files.
Reference: Search results for 'extract_tfrecords_features' - NVIDIA Developer Forums
An error occurred when running TLT training


Some tfrecords files are empty

So, I am afraid your images are a little small. So, 0 size tfrecord file are generated.
Please remove the empty tfrecords file.

I deleted the empty tfrecords file, but the same error still occurs

Can you share the log when you generate tfrecords files?

sure,

Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
2021-03-25 08:38:43,086 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter
2021-03-25 08:38:43,087 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in
Train: 27 Val: 4
2021-03-25 08:38:43,087 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2021-03-25 08:38:43,087 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:142: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

2021-03-25 08:38:43,087 - tensorflow - WARNING - From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:142: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

2021-03-25 08:38:43,087 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 1
2021-03-25 08:38:43,087 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 2
2021-03-25 08:38:43,087 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 3
2021-03-25 08:38:43,088 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 4
2021-03-25 08:38:43,088 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 5
2021-03-25 08:38:43,088 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 6
2021-03-25 08:38:43,088 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 7
2021-03-25 08:38:43,088 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 8
2021-03-25 08:38:43,088 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 9
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:273: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
2021-03-25 08:38:43,100 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’fence’: 5

2021-03-25 08:38:43,100 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 0
2021-03-25 08:38:43,102 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 1
2021-03-25 08:38:43,104 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 2
2021-03-25 08:38:43,106 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 3
2021-03-25 08:38:43,108 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 4
2021-03-25 08:38:43,111 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 5
2021-03-25 08:38:43,113 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 6
2021-03-25 08:38:43,115 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 7
2021-03-25 08:38:43,117 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 8
2021-03-25 08:38:43,120 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 9
2021-03-25 08:38:43,128 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’fence’: 27

2021-03-25 08:38:43,129 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Cumulative object statistics
2021-03-25 08:38:43,129 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’fence’: 32

2021-03-25 08:38:43,129 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Class map.
Label in GT: Label in tfrecords file
b’fence’: b’fence’
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2021-03-25 08:38:43,129 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Tfrecords generation complete.

Please refer to ZeroDivisionError when training peoplenet - #10 by Morganh

Your val images is only 4. It is smaller than num_shards.
Please add more val images or set smaller num_shards.

val_images is (val_split)% of total images.train_images is (100-val_split)% of total images.

Please make sure below at the same time.

  1. val_images >= num_shards
  2. train_images >= num_shards

I modified num_shards=1, but the same error still occurs.


Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
2021-03-25 09:38:31,280 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter
2021-03-25 09:38:31,281 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in
Train: 27 Val: 4
2021-03-25 09:38:31,281 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2021-03-25 09:38:31,281 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:142: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

2021-03-25 09:38:31,281 - tensorflow - WARNING - From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:142: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:273: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
2021-03-25 09:38:31,292 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’fence’: 5

2021-03-25 09:38:31,292 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 0
2021-03-25 09:38:31,319 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’fence’: 27

2021-03-25 09:38:31,319 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Cumulative object statistics
2021-03-25 09:38:31,319 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’fence’: 32

2021-03-25 09:38:31,319 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Class map.
Label in GT: Label in tfrecords file
b’fence’: b’fence’
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2021-03-25 09:38:31,319 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Tfrecords generation complete.

total 104
-rw-r–r-- 1 root root 0 Mar 25 08:38 kitti_trainval-fold-000-of-002-shard-00008-of-00010
-rw-r–r-- 1 root root 0 Mar 25 08:38 kitti_trainval-fold-000-of-002-shard-00007-of-00010
-rw-r–r-- 1 root root 0 Mar 25 08:38 kitti_trainval-fold-000-of-002-shard-00006-of-00010
-rw-r–r-- 1 root root 0 Mar 25 08:38 kitti_trainval-fold-000-of-002-shard-00005-of-00010
-rw-r–r-- 1 root root 0 Mar 25 08:38 kitti_trainval-fold-000-of-002-shard-00004-of-00010
-rw-r–r-- 1 root root 0 Mar 25 08:38 kitti_trainval-fold-000-of-002-shard-00003-of-00010
-rw-r–r-- 1 root root 0 Mar 25 08:38 kitti_trainval-fold-000-of-002-shard-00002-of-00010
-rw-r–r-- 1 root root 0 Mar 25 08:38 kitti_trainval-fold-000-of-002-shard-00001-of-00010
-rw-r–r-- 1 root root 0 Mar 25 08:38 kitti_trainval-fold-000-of-002-shard-00000-of-00010
-rw-r–r-- 1 root root 2455 Mar 25 08:38 kitti_trainval-fold-000-of-002-shard-00009-of-00010
-rw-r–r-- 1 root root 1198 Mar 25 08:38 kitti_trainval-fold-001-of-002-shard-00001-of-00010
-rw-r–r-- 1 root root 1196 Mar 25 08:38 kitti_trainval-fold-001-of-002-shard-00000-of-00010
-rw-r–r-- 1 root root 1197 Mar 25 08:38 kitti_trainval-fold-001-of-002-shard-00003-of-00010
-rw-r–r-- 1 root root 1197 Mar 25 08:38 kitti_trainval-fold-001-of-002-shard-00002-of-00010
-rw-r–r-- 1 root root 1198 Mar 25 08:38 kitti_trainval-fold-001-of-002-shard-00005-of-00010
-rw-r–r-- 1 root root 1198 Mar 25 08:38 kitti_trainval-fold-001-of-002-shard-00004-of-00010
-rw-r–r-- 1 root root 1197 Mar 25 08:38 kitti_trainval-fold-001-of-002-shard-00007-of-00010
-rw-r–r-- 1 root root 1198 Mar 25 08:38 kitti_trainval-fold-001-of-002-shard-00006-of-00010
-rw-r–r-- 1 root root 1197 Mar 25 08:38 kitti_trainval-fold-001-of-002-shard-00008-of-00010
-rw-r–r-- 1 root root 5389 Mar 25 08:38 kitti_trainval-fold-001-of-002-shard-00009-of-00010
-rw-r–r-- 1 root root 0 Mar 25 09:26 kitti_trainval-fold-000-of-002-shard-00000-of-00004
-rw-r–r-- 1 root root 0 Mar 25 09:26 kitti_trainval-fold-000-of-002-shard-00002-of-00004
-rw-r–r-- 1 root root 0 Mar 25 09:26 kitti_trainval-fold-000-of-002-shard-00001-of-00004
-rw-r–r-- 1 root root 1797 Mar 25 09:26 kitti_trainval-fold-000-of-002-shard-00003-of-00004
-rw-r–r-- 1 root root 4191 Mar 25 09:26 kitti_trainval-fold-001-of-002-shard-00000-of-00004
-rw-r–r-- 1 root root 4249 Mar 25 09:26 kitti_trainval-fold-001-of-002-shard-00001-of-00004
-rw-r–r-- 1 root root 4192 Mar 25 09:26 kitti_trainval-fold-001-of-002-shard-00002-of-00004
-rw-r–r-- 1 root root 4191 Mar 25 09:26 kitti_trainval-fold-001-of-002-shard-00003-of-00004
-rw-r–r-- 1 root root 2455 Mar 25 09:38 kitti_trainval-fold-000-of-002-shard-00000-of-00001
-rw-r–r-- 1 root root 16165 Mar 25 09:38 kitti_trainval-fold-001-of-002-shard-00000-of-00001

Please delete the tfrecords folder and generate tfrecords again. Because there are many 0 size tfrecords in your current folder.

down,thank you for your reply.