Unable to train yolov4 with Tao succesfully

• Hardware :GeForce GTX 1660 Ti Mobile
• Network Type :Yolo_v4

Hello I am trying to train the yolv4 with my custom data and getting error while training with out any proper information. could you please help me to check it out.
here is my training spec:
random_seed: 42
yolov4_config {
big_anchor_shape: “[(35.10, 10.80), (40.95, 12.00), (42.90, 13.20)]”
mid_anchor_shape: “[(48.75, 15.00), (54.60, 16.80), (60.45,19.20)]”
small_anchor_shape: “[(70.20, 22.20), (81.90, 27.00), (99.45,32.40)]”
box_matching_iou: 0.25
matching_neutral_box_iou: 0.5
arch: “resnet”
nlayers: 18
arch_conv_blocks: 2
loss_loc_weight: 1.0
loss_neg_obj_weights: 1.0
loss_class_weights: 1.0
label_smoothing: 0.0
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.1
small_grid_xy_extend: 0.2
freeze_bn: false
#freeze_blocks: 0
force_relu: false
}
training_config {
visualizer {
enabled: False
num_images: 1
}
batch_size_per_gpu: 4
num_epochs: 1
enable_qat: false
checkpoint_interval: 10
learning_rate {
soft_start_cosine_annealing_schedule {
min_learning_rate: 1e-7
max_learning_rate: 1e-4
soft_start: 0.3
}
}
regularizer {
type: L1
weight: 3e-5
}
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
}
}
pretrain_model_path: “/workspace/tao-experiments/yolo_v4/pretrained_resnet18/pretrained_object_detection_vresnet18/resnet_18.hdf5”
model_ema: true
}
eval_config {
average_precision_mode: SAMPLE
batch_size: 4
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.001
clustering_iou_threshold: 0.5
force_on_cpu: true
top_k: 200
}
augmentation_config {
hue: 0.1
saturation: 1.5
exposure:1.5
vertical_flip:0
horizontal_flip: 0.5
jitter: 0.3
output_width: 1248
output_height: 384
output_channel: 3
randomize_input_shape_period: 0
mosaic_prob: 0.5
mosaic_min_ratio:0.2
}
dataset_config {
data_sources: {
tfrecords_path: “/workspace/tao-experiments/data/training/tfrecords/train-fold*”
image_directory_path: “/workspace/tao-experiments/data/training”
}
include_difficult_in_training: true
image_extension: “jpg”

This is my error log:

To run with multigpu, please change --gpus based on the number of available GPUs in your machine.
/usr/lib/python3/dist-packages/paramiko/transport.py:219: CryptographyDeprecationWarning: Blowfish has been deprecated
“class”: algorithms.Blowfish,
2023-04-27 10:01:09,963 [INFO] root: Registry: [‘nvcr.io’]
2023-04-27 10:01:10,004 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
2023-04-27 10:01:10,062 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/usr/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
Using TensorFlow backend.
2023-04-27 02:01:12.507201: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn’t match a supported version!
RequestsDependencyWarning)
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn’t match a supported version!
RequestsDependencyWarning)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

INFO: Log file already exists at /workspace/tao-experiments/experiment_dir_unpruned/status.json
INFO: Starting Yolo_V4 Training job
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.

INFO: Serial augmentation enabled = False
INFO: Pseudo sharding enabled = False
INFO: Max Image Dimensions (all sources): (0, 0)
INFO: number of cpus: 12, io threads: 24, compute threads: 12, buffered batches: -1
INFO: total dataset size 361, number of sources: 1, batch size per gpu: 10, steps: 37
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:Entity <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7fd8b814c898>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7fd8b814c898>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING: Entity <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7fd8b814c898>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7fd8b814c898>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
INFO: shuffle: True - shard 0 of 1
INFO: sampling 1 datasets with weights:
INFO: source: 0 weight: 1.000000
WARNING:tensorflow:Entity <bound method Processor.call of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7fd8b06cabe0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.call of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7fd8b06cabe0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING: Entity <bound method Processor.call of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7fd8b06cabe0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.call of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7fd8b06cabe0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py:187: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py:187: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

INFO: Serial augmentation enabled = False
INFO: Pseudo sharding enabled = False
INFO: Max Image Dimensions (all sources): (0, 0)
INFO: number of cpus: 12, io threads: 24, compute threads: 12, buffered batches: -1
INFO: total dataset size 40, number of sources: 1, batch size per gpu: 4, steps: 10
WARNING:tensorflow:Entity <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7fd7820c5b38>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7fd7820c5b38>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING: Entity <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7fd7820c5b38>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7fd7820c5b38>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO: in converted code:

/usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py:353 _map_func_set_random_wrapper  *
    return map_func(*args, **kwargs)
<frozen iva.yolo_v3.data_loader.yolo_v3_data_loader>:164 __call__
    
<frozen iva.yolo_v3.data_loader.yolo_v3_data_loader>:132 _get_parse_example
    
<frozen iva.detectnet_v2.dataloader.utilities>:217 extract_tfrecords_features
    

StopIteration: 

Traceback (most recent call last):
File “</usr/local/lib/python3.6/dist-packages/iva/yolo_v4/scripts/train.py>”, line 3, in
File “”, line 152, in
File “”, line 707, in return_func
File “”, line 695, in return_func
File “”, line 148, in main
File “”, line 133, in main
File “”, line 78, in run_experiment
File “”, line 116, in build_training_pipeline
File “”, line 320, in init
File “”, line 645, in get_dataset_tensors
File “”, line 77, in call
File “”, line 598, in call
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 2081, in apply
return DatasetV1Adapter(super(DatasetV1, self).apply(transformation_func))
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 1422, in apply
dataset = transformation_func(self)
File “”, line 405, in
File “/usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py”, line 356, in new_map
self, _map_func_set_random_wrapper, num_parallel_calls=num_parallel_calls
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 2000, in map
MapDataset(self, map_func, preserve_cardinality=False))
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 3531, in init
use_legacy_function=use_legacy_function)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 2810, in init
self._function = wrapper_fn._get_concrete_function_internal()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py”, line 1853, in _get_concrete_function_internal
*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py”, line 1847, in _get_concrete_function_internal_garbage_collected
graph_function, _, _ = self._maybe_define_function(args, kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py”, line 2147, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py”, line 2038, in _create_graph_function
capture_by_value=self._capture_by_value),
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/func_graph.py”, line 915, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 2804, in wrapper_fn
ret = _wrapper_helper(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 2749, in _wrapper_helper
ret = autograph.tf_convert(func, ag_ctx)(*nested_args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py”, line 237, in wrapper
raise e.ag_error_metadata.to_exception(e)
StopIteration: in converted code:

/usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py:353 _map_func_set_random_wrapper  *
    return map_func(*args, **kwargs)
<frozen iva.yolo_v3.data_loader.yolo_v3_data_loader>:164 __call__
    
<frozen iva.yolo_v3.data_loader.yolo_v3_data_loader>:132 _get_parse_example
    
<frozen iva.detectnet_v2.dataloader.utilities>:217 extract_tfrecords_features
    

StopIteration: 

Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2023-04-27 10:02:38,947 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

.

It is related to your tfrecords files. Please remove any 0 size of files.
More, you can search the key info in TAO forum to find more related topics.

Thanks for your reply Morganh I trained my model successfully I converted the etlt to engine and used for inference with tensorrt but getting the following error, can you please take a look

inference.py:64: DeprecationWarning: Use get_tensor_shape instead.
size = trt.volume(engine.get_binding_shape(binding)) * batch_size
Traceback (most recent call last):
File “inference.py”, line 217, in
model_inference = YOLOv4(trt_engine_path=trt_model, new_shape=(640,480),
File “inference.py”, line 31, in init
inputs, outputs, bindings, stream = self.allocate_buffers(self.trt_engine)
File “inference.py”, line 68, in allocate_buffers
host_mem = cuda.pagelocked_empty(size, dtype)
pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory
[04/28/2023-02:18:39] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::42] Error Code 1: Cuda Runtime (invalid argument)
Segmentation fault (core dumped)

Please use “tao yolo_v4 inference” firstly to check if it works.

Tao inference is working normally with out any errors but when i use tensorrt unable to run the inference
used following from jupyter notebook to export the model.

!tao-deploy yolo_v4 gen_trt_engine -m $USER_EXPERIMENT_DIR/export/yolov4_resnet18_epoch_001.etlt
-k $KEY
-e $SPECS_DIR/yolo_v4_retrain_resnet18_kitti.txt
–batch_size 16
–min_batch_size 1
–opt_batch_size 2
–max_batch_size 16
–data_type fp32
–engine_file $USER_EXPERIMENT_DIR/export/trt.engine

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Please set lower max_batch_size and retry.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.