Narrow and long bounding boxes

Hello team,
Im training a yolo_v4_tiny for detection vehicles from pictures taken by drone (300m hight).
Im using the appropriate tao notebook script for training the network, and receiving strange results.
some of the bounding boxes are narrow and long, see at the image below
image
image

What could be the reason to that? how can I avoid that?
I config the anchor boxes with the provided script, and there is no such anchor box that can be fit to that size of bounding box.
Thank you.

Please firstly check the labels are correct. Then follow YOLOv4-tiny — TAO Toolkit 4.0 documentation to generate anchor shapes.

yes the labels are correct and I generate the anchor shapes using the TAO script (k-means)
and for some reason, Im getting those long narrow bounding boxes.

what should I do to improve the model so it will not create them?

The long narrow bbox are the inference result when you run “tao yolo_v4_tiny inference” , right?
How about the training result?
Can you share the training spec file and all the training log?

spec file:

random_seed: 42
yolov4_config {
  big_anchor_shape: "[(30.23, 17.06), (16.88, 6.03), (10.48, 8.59)]"
  mid_anchor_shape: "[(10.00, 4.11), (6.16, 5.88), (5.04, 3.45)]"
  box_matching_iou: 0.25
  matching_neutral_box_iou: 0.5
  arch: "cspdarknet_tiny"
  loss_loc_weight: 1.0
  loss_neg_obj_weights: 1.0
  loss_class_weights: 1.0
  label_smoothing: 0.0
  big_grid_xy_extend: 0.05
  mid_grid_xy_extend: 0.05
  freeze_bn: false
  #freeze_blocks: 0
  force_relu: false
}
training_config {
  visualizer {
      enabled: True
      num_images: 3
  }
  batch_size_per_gpu: 8
  num_epochs: 500
  enable_qat: true
  checkpoint_interval: 25
  learning_rate {
    soft_start_cosine_annealing_schedule {
      min_learning_rate: 1e-7
      max_learning_rate: 1e-4
      soft_start: 0.3
    }
  }
  regularizer {
    type: L1
    weight: 3e-5
  }
  optimizer {
    adam {
      epsilon: 1e-7
      beta1: 0.9
      beta2: 0.999
      amsgrad: false
    }
  }
  pretrain_model_path: "/workspace/tao-experiments/yolo_v4_tiny/pretrained_cspdarknet_tiny/pretrained_object_detection_vcspdarknet_tiny"
}
eval_config {
  average_precision_mode: SAMPLE
  batch_size: 8
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.001
  clustering_iou_threshold: 0.5
  force_on_cpu: true
  top_k: 200
}
augmentation_config {
  hue: 0.1
  saturation: 1.5
  exposure:1.5
  vertical_flip:0
  horizontal_flip: 0.5
  jitter: 0.3
  output_width: 1248
  output_height: 384
  output_channel: 3
  randomize_input_shape_period: 10
  mosaic_prob: 0.5
  mosaic_min_ratio:0.2
}
dataset_config {
  data_sources: {
      tfrecords_path: "/workspace/tao-experiments/data/training/tfrecords/train*"
      image_directory_path: "/workspace/tao-experiments/data/training"
  }
  include_difficult_in_training: true
  image_extension: "png"
  target_class_mapping {
      key: "bus"
      value: "bus"
  }
  target_class_mapping {
      key: "car"
      value: "car"
  }
  target_class_mapping {
      key: "motorcycle"
      value: "motorcycle"
  }

  validation_data_sources: {
      tfrecords_path: "/workspace/tao-experiments/data/val/tfrecords/val*"
      image_directory_path: "/workspace/tao-experiments/data/val"
  }
}

Please share full training log as well.

To run with multigpu, please change --gpus based on the number of available GPUs in your machine.
2022-12-12 11:31:43,662 [INFO] root: Registry: [‘nvcr.io’]
2022-12-12 11:31:43,716 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
2022-12-12 11:31:43,806 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/alexknish/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
Using TensorFlow backend.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn’t match a supported version!
RequestsDependencyWarning)

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py:42: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py:42: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py:45: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py:45: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

INFO: Log file already exists at /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned/status.json
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/data_loader/generate_shape_tensors.py:8: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/data_loader/generate_shape_tensors.py:8: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/data_loader/generate_shape_tensors.py:8: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/data_loader/generate_shape_tensors.py:8: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/data_loader/generate_shape_tensors.py:9: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/data_loader/generate_shape_tensors.py:9: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/data_loader/generate_shape_tensors.py:55: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/data_loader/generate_shape_tensors.py:55: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

WARNING: From /opt/nvidia/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

INFO: Serial augmentation enabled = False
INFO: Pseudo sharding enabled = False
INFO: Max Image Dimensions (all sources): (0, 0)
INFO: number of cpus: 16, io threads: 32, compute threads: 16, buffered batches: -1
INFO: total dataset size 1174, number of sources: 1, batch size per gpu: 20, steps: 59
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:Entity <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7f05f40844a8>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7f05f40844a8>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING: Entity <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7f05f40844a8>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7f05f40844a8>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
INFO: shuffle: True - shard 0 of 1
INFO: sampling 1 datasets with weights:
INFO: source: 0 weight: 1.000000
WARNING:tensorflow:Entity <bound method Processor.call of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f05bc231f98>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.call of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f05bc231f98>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING: Entity <bound method Processor.call of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f05bc231f98>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.call of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f05bc231f98>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
/opt/nvidia/third_party/keras/tensorflow_backend.py:356: UserWarning: Seed 42 from outer graph might be getting used by function Dataset_map__map_func_set_random_wrapper, if the random op has not been provided any seed. Explicitly set the seed in the function if this is not the intended behavior.
self, _map_func_set_random_wrapper, num_parallel_calls=num_parallel_calls
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py:302: UserWarning: tf.data static optimizations are not compatible with tf.Variable. The following optimizations will be disabled: map_and_batch_fusion, noop_elimination, shuffle_and_repeat_fusion. To enable optimizations, use resource variables instead by calling tf.enable_resource_variables() at the start of the program.
", ".join(static_optimizations))
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/dataio/tf_data_pipe.py:131: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/dataio/tf_data_pipe.py:131: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/visualizer/tensorboard_visualizer.py:79: The name tf.summary.image is deprecated. Please use tf.compat.v1.summary.image instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/visualizer/tensorboard_visualizer.py:79: The name tf.summary.image is deprecated. Please use tf.compat.v1.summary.image instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/losses/base_loss.py:40: The name tf.log is deprecated. Please use tf.math.log instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/losses/base_loss.py:40: The name tf.log is deprecated. Please use tf.math.log instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/visualizer/tensorboard_visualizer.py:85: The name tf.summary.histogram is deprecated. Please use tf.compat.v1.summary.histogram instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/visualizer/tensorboard_visualizer.py:85: The name tf.summary.histogram is deprecated. Please use tf.compat.v1.summary.histogram instead.

INFO: Serial augmentation enabled = False
INFO: Pseudo sharding enabled = False
INFO: Max Image Dimensions (all sources): (0, 0)
INFO: number of cpus: 16, io threads: 32, compute threads: 16, buffered batches: -1
INFO: total dataset size 470, number of sources: 1, batch size per gpu: 8, steps: 59
WARNING:tensorflow:Entity <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7f0475a9a7b8>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7f0475a9a7b8>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING: Entity <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7f0475a9a7b8>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7f0475a9a7b8>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
INFO: shuffle: False - shard 0 of 1
INFO: sampling 1 datasets with weights:
INFO: source: 0 weight: 1.000000
WARNING:tensorflow:Entity <bound method Processor.call of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f0475938dd8>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.call of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f0475938dd8>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING: Entity <bound method Processor.call of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f0475938dd8>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.call of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f0475938dd8>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:1123: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:1123: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

INFO: Log file already exists at /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned/status.json


Layer (type) Output Shape Param # Connected to

Input (InputLayer) (8, 3, None, None) 0


Input_qdq (QDQ) (8, 3, None, None) 1 Input[0][0]


conv_0 (QuantizedConv2D) (8, 32, None, None) 864 Input_qdq[0][0]


conv_0_bn (BatchNormalization) (8, 32, None, None) 128 conv_0[0][0]


conv_0_mish (ReLU) (8, 32, None, None) 0 conv_0_bn[0][0]


conv_0_mish_qdq (QDQ) (8, 32, None, None) 1 conv_0_mish[0][0]


conv_1 (QuantizedConv2D) (8, 64, None, None) 18432 conv_0_mish_qdq[0][0]


conv_1_bn (BatchNormalization) (8, 64, None, None) 256 conv_1[0][0]


conv_1_mish (ReLU) (8, 64, None, None) 0 conv_1_bn[0][0]


conv_1_mish_qdq (QDQ) (8, 64, None, None) 1 conv_1_mish[0][0]


conv_2_conv_0 (QuantizedConv2D) (8, 64, None, None) 36864 conv_1_mish_qdq[0][0]


conv_2_conv_0_bn (BatchNormaliz (8, 64, None, None) 256 conv_2_conv_0[0][0]


conv_2_conv_0_mish (ReLU) (8, 64, None, None) 0 conv_2_conv_0_bn[0][0]


conv_2_conv_0_mish_qdq (QDQ) (8, 64, None, None) 1 conv_2_conv_0_mish[0][0]


conv_2_split_0 (Split) (8, 32, None, None) 0 conv_2_conv_0_mish_qdq[0][0]


conv_2_split_0_qdq (QDQ) (8, 32, None, None) 1 conv_2_split_0[0][0]


conv_2_conv_1 (QuantizedConv2D) (8, 32, None, None) 9216 conv_2_split_0_qdq[0][0]


conv_2_conv_1_bn (BatchNormaliz (8, 32, None, None) 128 conv_2_conv_1[0][0]


conv_2_conv_1_mish (ReLU) (8, 32, None, None) 0 conv_2_conv_1_bn[0][0]


conv_2_conv_1_mish_qdq (QDQ) (8, 32, None, None) 1 conv_2_conv_1_mish[0][0]


conv_2_conv_2 (QuantizedConv2D) (8, 32, None, None) 9216 conv_2_conv_1_mish_qdq[0][0]


conv_2_conv_2_bn (BatchNormaliz (8, 32, None, None) 128 conv_2_conv_2[0][0]


conv_2_conv_2_mish (ReLU) (8, 32, None, None) 0 conv_2_conv_2_bn[0][0]


conv_2_conv_2_mish_qdq (QDQ) (8, 32, None, None) 1 conv_2_conv_2_mish[0][0]


conv_2_concat_0 (Concatenate) (8, 64, None, None) 0 conv_2_conv_2_mish_qdq[0][0]
conv_2_conv_1_mish_qdq[0][0]


conv_2_concat_0_qdq (QDQ) (8, 64, None, None) 1 conv_2_concat_0[0][0]


conv_2_conv_3 (QuantizedConv2D) (8, 64, None, None) 4096 conv_2_concat_0_qdq[0][0]


conv_2_conv_3_bn (BatchNormaliz (8, 64, None, None) 256 conv_2_conv_3[0][0]


conv_2_conv_3_mish (ReLU) (8, 64, None, None) 0 conv_2_conv_3_bn[0][0]


conv_2_conv_3_mish_qdq (QDQ) (8, 64, None, None) 1 conv_2_conv_3_mish[0][0]


conv_2_concat_1 (Concatenate) (8, 128, None, None) 0 conv_2_conv_0_mish_qdq[0][0]
conv_2_conv_3_mish_qdq[0][0]


conv_2_concat_1_qdq (QDQ) (8, 128, None, None) 1 conv_2_concat_1[0][0]


conv_2_pool_0 (MaxPooling2D) (8, 128, None, None) 0 conv_2_concat_1_qdq[0][0]


conv_2_pool_0_qdq (QDQ) (8, 128, None, None) 1 conv_2_pool_0[0][0]


conv_3_conv_0 (QuantizedConv2D) (8, 128, None, None) 147456 conv_2_pool_0_qdq[0][0]


conv_3_conv_0_bn (BatchNormaliz (8, 128, None, None) 512 conv_3_conv_0[0][0]


conv_3_conv_0_mish (ReLU) (8, 128, None, None) 0 conv_3_conv_0_bn[0][0]


conv_3_conv_0_mish_qdq (QDQ) (8, 128, None, None) 1 conv_3_conv_0_mish[0][0]


conv_3_split_0 (Split) (8, 64, None, None) 0 conv_3_conv_0_mish_qdq[0][0]


conv_3_split_0_qdq (QDQ) (8, 64, None, None) 1 conv_3_split_0[0][0]


conv_3_conv_1 (QuantizedConv2D) (8, 64, None, None) 36864 conv_3_split_0_qdq[0][0]


conv_3_conv_1_bn (BatchNormaliz (8, 64, None, None) 256 conv_3_conv_1[0][0]


conv_3_conv_1_mish (ReLU) (8, 64, None, None) 0 conv_3_conv_1_bn[0][0]


conv_3_conv_1_mish_qdq (QDQ) (8, 64, None, None) 1 conv_3_conv_1_mish[0][0]


conv_3_conv_2 (QuantizedConv2D) (8, 64, None, None) 36864 conv_3_conv_1_mish_qdq[0][0]


conv_3_conv_2_bn (BatchNormaliz (8, 64, None, None) 256 conv_3_conv_2[0][0]


conv_3_conv_2_mish (ReLU) (8, 64, None, None) 0 conv_3_conv_2_bn[0][0]


conv_3_conv_2_mish_qdq (QDQ) (8, 64, None, None) 1 conv_3_conv_2_mish[0][0]


conv_3_concat_0 (Concatenate) (8, 128, None, None) 0 conv_3_conv_2_mish_qdq[0][0]
conv_3_conv_1_mish_qdq[0][0]


conv_3_concat_0_qdq (QDQ) (8, 128, None, None) 1 conv_3_concat_0[0][0]


conv_3_conv_3 (QuantizedConv2D) (8, 128, None, None) 16384 conv_3_concat_0_qdq[0][0]


conv_3_conv_3_bn (BatchNormaliz (8, 128, None, None) 512 conv_3_conv_3[0][0]


conv_3_conv_3_mish (ReLU) (8, 128, None, None) 0 conv_3_conv_3_bn[0][0]


conv_3_conv_3_mish_qdq (QDQ) (8, 128, None, None) 1 conv_3_conv_3_mish[0][0]


conv_3_concat_1 (Concatenate) (8, 256, None, None) 0 conv_3_conv_0_mish_qdq[0][0]
conv_3_conv_3_mish_qdq[0][0]


conv_3_concat_1_qdq (QDQ) (8, 256, None, None) 1 conv_3_concat_1[0][0]


conv_3_pool_0 (MaxPooling2D) (8, 256, None, None) 0 conv_3_concat_1_qdq[0][0]


conv_3_pool_0_qdq (QDQ) (8, 256, None, None) 1 conv_3_pool_0[0][0]


conv_4_conv_0 (QuantizedConv2D) (8, 256, None, None) 589824 conv_3_pool_0_qdq[0][0]


conv_4_conv_0_bn (BatchNormaliz (8, 256, None, None) 1024 conv_4_conv_0[0][0]


conv_4_conv_0_mish (ReLU) (8, 256, None, None) 0 conv_4_conv_0_bn[0][0]


conv_4_conv_0_mish_qdq (QDQ) (8, 256, None, None) 1 conv_4_conv_0_mish[0][0]


conv_4_split_0 (Split) (8, 128, None, None) 0 conv_4_conv_0_mish_qdq[0][0]


conv_4_split_0_qdq (QDQ) (8, 128, None, None) 1 conv_4_split_0[0][0]


conv_4_conv_1 (QuantizedConv2D) (8, 128, None, None) 147456 conv_4_split_0_qdq[0][0]


conv_4_conv_1_bn (BatchNormaliz (8, 128, None, None) 512 conv_4_conv_1[0][0]


conv_4_conv_1_mish (ReLU) (8, 128, None, None) 0 conv_4_conv_1_bn[0][0]


conv_4_conv_1_mish_qdq (QDQ) (8, 128, None, None) 1 conv_4_conv_1_mish[0][0]


conv_4_conv_2 (QuantizedConv2D) (8, 128, None, None) 147456 conv_4_conv_1_mish_qdq[0][0]


conv_4_conv_2_bn (BatchNormaliz (8, 128, None, None) 512 conv_4_conv_2[0][0]


conv_4_conv_2_mish (ReLU) (8, 128, None, None) 0 conv_4_conv_2_bn[0][0]


conv_4_conv_2_mish_qdq (QDQ) (8, 128, None, None) 1 conv_4_conv_2_mish[0][0]


conv_4_concat_0 (Concatenate) (8, 256, None, None) 0 conv_4_conv_2_mish_qdq[0][0]
conv_4_conv_1_mish_qdq[0][0]


conv_4_concat_0_qdq (QDQ) (8, 256, None, None) 1 conv_4_concat_0[0][0]


conv_4_conv_3 (QuantizedConv2D) (8, 256, None, None) 65536 conv_4_concat_0_qdq[0][0]


conv_4_conv_3_bn (BatchNormaliz (8, 256, None, None) 1024 conv_4_conv_3[0][0]


conv_4_conv_3_mish (ReLU) (8, 256, None, None) 0 conv_4_conv_3_bn[0][0]


conv_4_conv_3_mish_qdq (QDQ) (8, 256, None, None) 1 conv_4_conv_3_mish[0][0]


conv_4_concat_1 (Concatenate) (8, 512, None, None) 0 conv_4_conv_0_mish_qdq[0][0]
conv_4_conv_3_mish_qdq[0][0]


conv_4_concat_1_qdq (QDQ) (8, 512, None, None) 1 conv_4_concat_1[0][0]


conv_4_pool_0 (MaxPooling2D) (8, 512, None, None) 0 conv_4_concat_1_qdq[0][0]


conv_4_pool_0_qdq (QDQ) (8, 512, None, None) 1 conv_4_pool_0[0][0]


conv_5 (QuantizedConv2D) (8, 512, None, None) 2359296 conv_4_pool_0_qdq[0][0]


conv_5_bn (BatchNormalization) (8, 512, None, None) 2048 conv_5[0][0]


conv_5_mish (ReLU) (8, 512, None, None) 0 conv_5_bn[0][0]


conv_5_mish_qdq (QDQ) (8, 512, None, None) 1 conv_5_mish[0][0]


yolo_conv1_1 (QuantizedConv2D) (8, 256, None, None) 131072 conv_5_mish_qdq[0][0]


yolo_conv1_1_bn (BatchNormaliza (8, 256, None, None) 1024 yolo_conv1_1[0][0]


yolo_conv1_1_lrelu (ReLU) (8, 256, None, None) 0 yolo_conv1_1_bn[0][0]


yolo_conv1_1_lrelu_qdq (QDQ) (8, 256, None, None) 1 yolo_conv1_1_lrelu[0][0]


yolo_conv2 (QuantizedConv2D) (8, 128, None, None) 32768 yolo_conv1_1_lrelu_qdq[0][0]


yolo_conv2_bn (BatchNormalizati (8, 128, None, None) 512 yolo_conv2[0][0]


yolo_conv2_lrelu (ReLU) (8, 128, None, None) 0 yolo_conv2_bn[0][0]


yolo_conv2_lrelu_qdq (QDQ) (8, 128, None, None) 1 yolo_conv2_lrelu[0][0]


upsample0 (UpSampling2D) (8, 128, None, None) 0 yolo_conv2_lrelu_qdq[0][0]


upsample0_qdq (QDQ) (8, 128, None, None) 1 upsample0[0][0]


concatenate_2 (Concatenate) (8, 384, None, None) 0 upsample0_qdq[0][0]
conv_4_conv_3_mish_qdq[0][0]


concatenate_2_qdq (QDQ) (8, 384, None, None) 1 concatenate_2[0][0]


yolo_conv1_6 (QuantizedConv2D) (8, 512, None, None) 1179648 yolo_conv1_1_lrelu_qdq[0][0]


yolo_conv3_6 (QuantizedConv2D) (8, 256, None, None) 884736 concatenate_2_qdq[0][0]


yolo_conv1_6_bn (BatchNormaliza (8, 512, None, None) 2048 yolo_conv1_6[0][0]


yolo_conv3_6_bn (BatchNormaliza (8, 256, None, None) 1024 yolo_conv3_6[0][0]


yolo_conv1_6_lrelu (ReLU) (8, 512, None, None) 0 yolo_conv1_6_bn[0][0]


yolo_conv3_6_lrelu (ReLU) (8, 256, None, None) 0 yolo_conv3_6_bn[0][0]


yolo_conv1_6_lrelu_qdq (QDQ) (8, 512, None, None) 1 yolo_conv1_6_lrelu[0][0]


yolo_conv3_6_lrelu_qdq (QDQ) (8, 256, None, None) 1 yolo_conv3_6_lrelu[0][0]


conv_big_object (Conv2D) (8, 33, None, None) 16929 yolo_conv1_6_lrelu_qdq[0][0]


conv_mid_object (Conv2D) (8, 33, None, None) 8481 yolo_conv3_6_lrelu_qdq[0][0]


bg_permute (Permute) (8, None, None, 33) 0 conv_big_object[0][0]


md_permute (Permute) (8, None, None, 33) 0 conv_mid_object[0][0]


bg_reshape (Reshape) (8, None, 11) 0 bg_permute[0][0]


md_reshape (Reshape) (8, None, 11) 0 md_permute[0][0]


bg_anchor (YOLOAnchorBox) (8, None, 6) 0 conv_big_object[0][0]


bg_bbox_processor (BBoxPostProc (8, None, 11) 0 bg_reshape[0][0]


md_anchor (YOLOAnchorBox) (8, None, 6) 0 conv_mid_object[0][0]


md_bbox_processor (BBoxPostProc (8, None, 11) 0 md_reshape[0][0]


encoded_bg (Concatenate) (8, None, 17) 0 bg_anchor[0][0]
bg_bbox_processor[0][0]


encoded_md (Concatenate) (8, None, 17) 0 md_anchor[0][0]
md_bbox_processor[0][0]


encoded_detections (Concatenate (8, None, 17) 0 encoded_bg[0][0]
encoded_md[0][0]

Total params: 5,891,908
Trainable params: 5,885,666
Non-trainable params: 6,242


WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:7: The name tf.local_variables_initializer is deprecated. Please use tf.compat.v1.local_variables_initializer instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:7: The name tf.local_variables_initializer is deprecated. Please use tf.compat.v1.local_variables_initializer instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:8: The name tf.tables_initializer is deprecated. Please use tf.compat.v1.tables_initializer instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:8: The name tf.tables_initializer is deprecated. Please use tf.compat.v1.tables_initializer instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:9: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:9: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:1171: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:1171: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

INFO: Starting Training Loop.
Epoch 251/500
1/147 […] - ETA: 1:02:26 - loss: 64.7614WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:186: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:186: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

2/147 […] - ETA: 41:29 - loss: 67.3439 /usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (3.784871). Check your callbacks.
% delta_t_median)
147/147 [==============================] - 232s 2s/step - loss: 70.2053
897376ad6fe1:78:133 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0>
897376ad6fe1:78:133 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
897376ad6fe1:78:133 [0] NCCL INFO P2P plugin IBext
897376ad6fe1:78:133 [0] NCCL INFO NET/IB : No device found.
897376ad6fe1:78:133 [0] NCCL INFO NET/IB : No device found.
897376ad6fe1:78:133 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0>
897376ad6fe1:78:133 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
897376ad6fe1:78:133 [0] NCCL INFO Channel 00/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 01/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 02/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 03/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 04/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 05/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 06/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 07/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 08/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 09/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 10/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 11/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 12/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 13/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 14/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 15/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 16/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 17/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 18/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 19/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 20/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 21/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 22/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 23/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 24/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 25/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 26/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 27/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 28/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 29/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 30/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Channel 31/32 : 0
897376ad6fe1:78:133 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
897376ad6fe1:78:133 [0] NCCL INFO Connected all rings
897376ad6fe1:78:133 [0] NCCL INFO Connected all trees
897376ad6fe1:78:133 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
897376ad6fe1:78:133 [0] NCCL INFO comm 0x7f06187cefa0 rank 0 nranks 1 cudaDev 0 busId 1000 - Init COMPLETE
INFO: Training loop in progress
Epoch 252/500
147/147 [==============================] - 204s 1s/step - loss: 61.2787
INFO: Training loop in progress
Epoch 253/500
147/147 [==============================] - 205s 1s/step - loss: 64.2056
INFO: Training loop in progress
Epoch 254/500
147/147 [==============================] - 205s 1s/step - loss: 67.1287
INFO: Training loop in progress
Epoch 255/500
147/147 [==============================] - 198s 1s/step - loss: 67.6356
INFO: Training loop in progress
Epoch 256/500
147/147 [==============================] - 189s 1s/step - loss: 65.1828
INFO: Training loop in progress
Epoch 257/500
147/147 [==============================] - 198s 1s/step - loss: 66.8854
INFO: Training loop in progress
Epoch 258/500
147/147 [==============================] - 188s 1s/step - loss: 66.0739
INFO: Training loop in progress
Epoch 259/500
147/147 [==============================] - 207s 1s/step - loss: 70.7531
INFO: Training loop in progress
Epoch 260/500
147/147 [==============================] - 197s 1s/step - loss: 67.0885
INFO: Training loop in progress
Epoch 261/500
147/147 [==============================] - 178s 1s/step - loss: 67.2656
INFO: Training loop in progress
Epoch 262/500
147/147 [==============================] - 197s 1s/step - loss: 66.9330
INFO: Training loop in progress
Epoch 263/500
147/147 [==============================] - 173s 1s/step - loss: 65.8832
INFO: Training loop in progress
Epoch 264/500
147/147 [==============================] - 187s 1s/step - loss: 62.9290
INFO: Training loop in progress
Epoch 265/500
147/147 [==============================] - 191s 1s/step - loss: 62.0031
INFO: Training loop in progress
Epoch 266/500
147/147 [==============================] - 178s 1s/step - loss: 64.7157
INFO: Training loop in progress
Epoch 267/500
147/147 [==============================] - 197s 1s/step - loss: 67.0461
INFO: Training loop in progress
Epoch 268/500
147/147 [==============================] - 202s 1s/step - loss: 60.5556
INFO: Training loop in progress
Epoch 269/500
147/147 [==============================] - 184s 1s/step - loss: 66.5040
INFO: Training loop in progress
Epoch 270/500
147/147 [==============================] - 169s 1s/step - loss: 61.5989
INFO: Training loop in progress
Epoch 271/500
1/147 […] - ETA: 3:49 - loss: 64.7281/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (0.437024). Check your callbacks.
% delta_t_median)
147/147 [==============================] - 170s 1s/step - loss: 60.4126
INFO: Training loop in progress
Epoch 272/500
147/147 [==============================] - 176s 1s/step - loss: 58.0459
INFO: Training loop in progress
Epoch 273/500
147/147 [==============================] - 168s 1s/step - loss: 66.4481
INFO: Training loop in progress
Epoch 274/500
147/147 [==============================] - 170s 1s/step - loss: 62.8000
INFO: Training loop in progress
Epoch 275/500
147/147 [==============================] - 184s 1s/step - loss: 60.6755
Producing predictions: 100%|████████████████████| 59/59 [00:27<00:00, 2.12it/s]
Start to calculate AP for each class


bus AP 0.66084
car AP 0.38857
motorcycle AP 0.15224
person AP 0.4014
truck AP 0.42886
van AP 0.73351
mAP 0.4609


Validation loss: 37.082874540555274

Epoch 00275: saving model to /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned/weights/yolov4_cspdarknet_tiny_epoch_275.tlt
INFO: Training loop in progress
Epoch 276/500
147/147 [==============================] - 168s 1s/step - loss: 60.5292
INFO: Training loop in progress
Epoch 277/500
147/147 [==============================] - 178s 1s/step - loss: 58.5089
INFO: Training loop in progress
Epoch 278/500
147/147 [==============================] - 159s 1s/step - loss: 60.8334
INFO: Training loop in progress
Epoch 279/500
147/147 [==============================] - 178s 1s/step - loss: 61.1158
INFO: Training loop in progress
Epoch 280/500
147/147 [==============================] - 164s 1s/step - loss: 64.5135
INFO: Training loop in progress
Epoch 281/500
147/147 [==============================] - 169s 1s/step - loss: 62.3210
INFO: Training loop in progress
Epoch 282/500
2/147 […] - ETA: 3:10 - loss: 70.7986/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (0.613669). Check your callbacks.
% delta_t_median)
3/147 […] - ETA: 3:04 - loss: 67.1386/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (0.588989). Check your callbacks.
% delta_t_median)
147/147 [==============================] - 170s 1s/step - loss: 65.8137
INFO: Training loop in progress
Epoch 283/500
147/147 [==============================] - 161s 1s/step - loss: 65.4904
INFO: Training loop in progress
Epoch 284/500
147/147 [==============================] - 174s 1s/step - loss: 61.6379
INFO: Training loop in progress
Epoch 285/500
147/147 [==============================] - 173s 1s/step - loss: 61.6149
INFO: Training loop in progress
Epoch 286/500
147/147 [==============================] - 174s 1s/step - loss: 61.4671
INFO: Training loop in progress
Epoch 287/500
147/147 [==============================] - 171s 1s/step - loss: 61.2316
INFO: Training loop in progress
Epoch 288/500
147/147 [==============================] - 176s 1s/step - loss: 64.7094
INFO: Training loop in progress
Epoch 289/500
147/147 [==============================] - 182s 1s/step - loss: 63.1376
INFO: Training loop in progress
Epoch 290/500
147/147 [==============================] - 175s 1s/step - loss: 59.9275
INFO: Training loop in progress
Epoch 291/500
147/147 [==============================] - 172s 1s/step - loss: 62.1484
INFO: Training loop in progress
Epoch 292/500
147/147 [==============================] - 181s 1s/step - loss: 64.2922
INFO: Training loop in progress
Epoch 293/500
147/147 [==============================] - 178s 1s/step - loss: 60.2746
INFO: Training loop in progress
Epoch 294/500
147/147 [==============================] - 184s 1s/step - loss: 64.8252
INFO: Training loop in progress
Epoch 295/500
147/147 [==============================] - 173s 1s/step - loss: 63.1392
INFO: Training loop in progress
Epoch 296/500
147/147 [==============================] - 177s 1s/step - loss: 63.4977
INFO: Training loop in progress
Epoch 297/500
147/147 [==============================] - 175s 1s/step - loss: 63.2056
INFO: Training loop in progress
Epoch 298/500
147/147 [==============================] - 184s 1s/step - loss: 57.6764
INFO: Training loop in progress
Epoch 299/500
147/147 [==============================] - 167s 1s/step - loss: 70.7185
INFO: Training loop in progress
Epoch 300/500
147/147 [==============================] - 183s 1s/step - loss: 64.6224
Producing predictions: 100%|████████████████████| 59/59 [00:40<00:00, 1.46it/s]
Start to calculate AP for each class


bus AP 0.74199
car AP 0.38829
motorcycle AP 0.04584
person AP 0.39035
truck AP 0.43023
van AP 0.74972
mAP 0.45774


Validation loss: 35.68125428991803

Epoch 00300: saving model to /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned/weights/yolov4_cspdarknet_tiny_epoch_300.tlt
INFO: Training loop in progress
Epoch 301/500
147/147 [==============================] - 165s 1s/step - loss: 61.7941
INFO: Training loop in progress
Epoch 302/500
147/147 [==============================] - 167s 1s/step - loss: 64.3165
INFO: Training loop in progress
Epoch 303/500
147/147 [==============================] - 158s 1s/step - loss: 57.0432
INFO: Training loop in progress
Epoch 304/500
147/147 [==============================] - 171s 1s/step - loss: 61.1457
INFO: Training loop in progress
Epoch 305/500
147/147 [==============================] - 159s 1s/step - loss: 63.8288
INFO: Training loop in progress
Epoch 306/500
147/147 [==============================] - 167s 1s/step - loss: 56.2817
INFO: Training loop in progress
Epoch 307/500
147/147 [==============================] - 166s 1s/step - loss: 58.7282
INFO: Training loop in progress
Epoch 308/500
147/147 [==============================] - 170s 1s/step - loss: 63.7764
INFO: Training loop in progress
Epoch 309/500
147/147 [==============================] - 168s 1s/step - loss: 65.0983
INFO: Training loop in progress
Epoch 310/500
147/147 [==============================] - 169s 1s/step - loss: 61.8188
INFO: Training loop in progress
Epoch 311/500
147/147 [==============================] - 175s 1s/step - loss: 60.9837
INFO: Training loop in progress
Epoch 312/500
147/147 [==============================] - 164s 1s/step - loss: 58.7195
INFO: Training loop in progress
Epoch 313/500
147/147 [==============================] - 167s 1s/step - loss: 55.3707
INFO: Training loop in progress
Epoch 314/500
147/147 [==============================] - 169s 1s/step - loss: 59.4272
INFO: Training loop in progress
Epoch 315/500
147/147 [==============================] - 171s 1s/step - loss: 58.3997
INFO: Training loop in progress
Epoch 316/500
147/147 [==============================] - 172s 1s/step - loss: 60.6537
INFO: Training loop in progress
Epoch 317/500
147/147 [==============================] - 184s 1s/step - loss: 60.9321
INFO: Training loop in progress
Epoch 318/500
147/147 [==============================] - 182s 1s/step - loss: 62.5915
INFO: Training loop in progress
Epoch 319/500
147/147 [==============================] - 179s 1s/step - loss: 60.4852
INFO: Training loop in progress
Epoch 320/500
147/147 [==============================] - 182s 1s/step - loss: 60.6551
INFO: Training loop in progress
Epoch 321/500
147/147 [==============================] - 174s 1s/step - loss: 60.3092
INFO: Training loop in progress
Epoch 322/500
147/147 [==============================] - 177s 1s/step - loss: 63.3549
INFO: Training loop in progress
Epoch 323/500
147/147 [==============================] - 178s 1s/step - loss: 57.7312
INFO: Training loop in progress
Epoch 324/500
147/147 [==============================] - 178s 1s/step - loss: 60.7344
INFO: Training loop in progress
Epoch 325/500
147/147 [==============================] - 177s 1s/step - loss: 60.8100
Producing predictions: 100%|████████████████████| 59/59 [00:39<00:00, 1.50it/s]
Start to calculate AP for each class


bus AP 0.79575
car AP 0.47595
motorcycle AP 0.15152
person AP 0.41054
truck AP 0.46372
van AP 0.75958
mAP 0.50951


Validation loss: 35.24135660721084

Epoch 00325: saving model to /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned/weights/yolov4_cspdarknet_tiny_epoch_325.tlt
INFO: Training loop in progress
Epoch 326/500
147/147 [==============================] - 160s 1s/step - loss: 56.5928
INFO: Training loop in progress
Epoch 327/500
147/147 [==============================] - 159s 1s/step - loss: 58.1513
INFO: Training loop in progress
Epoch 328/500
147/147 [==============================] - 178s 1s/step - loss: 56.9702
INFO: Training loop in progress
Epoch 329/500
147/147 [==============================] - 178s 1s/step - loss: 63.1216
INFO: Training loop in progress
Epoch 330/500
147/147 [==============================] - 183s 1s/step - loss: 58.2748
INFO: Training loop in progress
Epoch 331/500
147/147 [==============================] - 186s 1s/step - loss: 57.1779
INFO: Training loop in progress
Epoch 332/500
147/147 [==============================] - 185s 1s/step - loss: 61.0566
INFO: Training loop in progress
Epoch 333/500
147/147 [==============================] - 180s 1s/step - loss: 63.1566
INFO: Training loop in progress
Epoch 334/500
147/147 [==============================] - 179s 1s/step - loss: 60.0457
INFO: Training loop in progress
Epoch 335/500
147/147 [==============================] - 179s 1s/step - loss: 57.8198
INFO: Training loop in progress
Epoch 336/500
147/147 [==============================] - 178s 1s/step - loss: 59.5335
INFO: Training loop in progress
Epoch 337/500
147/147 [==============================] - 177s 1s/step - loss: 61.8735
INFO: Training loop in progress
Epoch 338/500
147/147 [==============================] - 178s 1s/step - loss: 64.2390
INFO: Training loop in progress
Epoch 339/500
147/147 [==============================] - 175s 1s/step - loss: 58.9039
INFO: Training loop in progress
Epoch 340/500
147/147 [==============================] - 176s 1s/step - loss: 54.2408
INFO: Training loop in progress
Epoch 341/500
147/147 [==============================] - 178s 1s/step - loss: 55.2539
INFO: Training loop in progress
Epoch 342/500
147/147 [==============================] - 175s 1s/step - loss: 59.7809
INFO: Training loop in progress
Epoch 343/500
147/147 [==============================] - 175s 1s/step - loss: 60.1105
INFO: Training loop in progress
Epoch 344/500
147/147 [==============================] - 176s 1s/step - loss: 61.5409
INFO: Training loop in progress
Epoch 345/500
147/147 [==============================] - 179s 1s/step - loss: 59.7987
INFO: Training loop in progress
Epoch 346/500
147/147 [==============================] - 174s 1s/step - loss: 60.1907
INFO: Training loop in progress
Epoch 347/500
147/147 [==============================] - 177s 1s/step - loss: 58.4336
INFO: Training loop in progress
Epoch 348/500
147/147 [==============================] - 177s 1s/step - loss: 58.5590
INFO: Training loop in progress
Epoch 349/500
147/147 [==============================] - 177s 1s/step - loss: 59.0470
INFO: Training loop in progress
Epoch 350/500
147/147 [==============================] - 175s 1s/step - loss: 62.5682
Producing predictions: 100%|████████████████████| 59/59 [00:40<00:00, 1.46it/s]
Start to calculate AP for each class


bus AP 0.71681
car AP 0.48087
motorcycle AP 0.22078
person AP 0.43571
truck AP 0.45203
van AP 0.77525
mAP 0.51357


Validation loss: 34.56129410307286

Epoch 00350: saving model to /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned/weights/yolov4_cspdarknet_tiny_epoch_350.tlt
INFO: Training loop in progress
Epoch 351/500
147/147 [==============================] - 163s 1s/step - loss: 57.4099
INFO: Training loop in progress
Epoch 352/500
147/147 [==============================] - 161s 1s/step - loss: 56.3537
INFO: Training loop in progress
Epoch 353/500
147/147 [==============================] - 181s 1s/step - loss: 56.1302
INFO: Training loop in progress
Epoch 354/500
147/147 [==============================] - 180s 1s/step - loss: 59.2711
INFO: Training loop in progress
Epoch 355/500
147/147 [==============================] - 179s 1s/step - loss: 58.0873
INFO: Training loop in progress
Epoch 356/500
147/147 [==============================] - 177s 1s/step - loss: 59.4608
INFO: Training loop in progress
Epoch 357/500
147/147 [==============================] - 179s 1s/step - loss: 65.7877
INFO: Training loop in progress
Epoch 358/500
147/147 [==============================] - 181s 1s/step - loss: 58.9386
INFO: Training loop in progress
Epoch 359/500
147/147 [==============================] - 176s 1s/step - loss: 56.5022
INFO: Training loop in progress
Epoch 360/500
147/147 [==============================] - 180s 1s/step - loss: 55.3920
INFO: Training loop in progress
Epoch 361/500
147/147 [==============================] - 179s 1s/step - loss: 56.3562
INFO: Training loop in progress
Epoch 362/500
147/147 [==============================] - 179s 1s/step - loss: 53.3507
INFO: Training loop in progress
Epoch 363/500
147/147 [==============================] - 184s 1s/step - loss: 55.3905
INFO: Training loop in progress
Epoch 364/500
147/147 [==============================] - 173s 1s/step - loss: 59.1503
INFO: Training loop in progress
Epoch 365/500
147/147 [==============================] - 180s 1s/step - loss: 60.5284
INFO: Training loop in progress
Epoch 366/500
147/147 [==============================] - 177s 1s/step - loss: 57.1750
INFO: Training loop in progress
Epoch 367/500
147/147 [==============================] - 179s 1s/step - loss: 55.3621
INFO: Training loop in progress
Epoch 368/500
147/147 [==============================] - 179s 1s/step - loss: 55.6117
INFO: Training loop in progress
Epoch 369/500
147/147 [==============================] - 178s 1s/step - loss: 57.3169
INFO: Training loop in progress
Epoch 370/500
147/147 [==============================] - 180s 1s/step - loss: 52.6903
INFO: Training loop in progress
Epoch 371/500
147/147 [==============================] - 180s 1s/step - loss: 56.2942
INFO: Training loop in progress
Epoch 372/500
147/147 [==============================] - 179s 1s/step - loss: 60.1888
INFO: Training loop in progress
Epoch 373/500
147/147 [==============================] - 179s 1s/step - loss: 58.3556
INFO: Training loop in progress
Epoch 374/500
147/147 [==============================] - 182s 1s/step - loss: 58.4871
INFO: Training loop in progress
Epoch 375/500
147/147 [==============================] - 186s 1s/step - loss: 55.7557
Producing predictions: 100%|████████████████████| 59/59 [00:42<00:00, 1.39it/s]
Start to calculate AP for each class


bus AP 0.69201
car AP 0.4888
motorcycle AP 0.12121
person AP 0.43793
truck AP 0.50101
van AP 0.76325
mAP 0.5007


Validation loss: 34.62910021765757

Epoch 00375: saving model to /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned/weights/yolov4_cspdarknet_tiny_epoch_375.tlt
INFO: Training loop in progress
Epoch 376/500
147/147 [==============================] - 165s 1s/step - loss: 57.6895
INFO: Training loop in progress
Epoch 377/500
147/147 [==============================] - 162s 1s/step - loss: 59.8198
INFO: Training loop in progress
Epoch 378/500
147/147 [==============================] - 183s 1s/step - loss: 58.1805
INFO: Training loop in progress
Epoch 379/500
147/147 [==============================] - 182s 1s/step - loss: 58.1707
INFO: Training loop in progress
Epoch 380/500
147/147 [==============================] - 185s 1s/step - loss: 59.8787
INFO: Training loop in progress
Epoch 381/500
147/147 [==============================] - 186s 1s/step - loss: 60.9108
INFO: Training loop in progress
Epoch 382/500
147/147 [==============================] - 188s 1s/step - loss: 57.9279
INFO: Training loop in progress
Epoch 383/500
147/147 [==============================] - 187s 1s/step - loss: 55.1692
INFO: Training loop in progress
Epoch 384/500
147/147 [==============================] - 187s 1s/step - loss: 60.0766
INFO: Training loop in progress
Epoch 385/500
147/147 [==============================] - 188s 1s/step - loss: 55.3396
INFO: Training loop in progress
Epoch 386/500
147/147 [==============================] - 187s 1s/step - loss: 58.4062
INFO: Training loop in progress
Epoch 387/500
147/147 [==============================] - 187s 1s/step - loss: 55.1818
INFO: Training loop in progress
Epoch 388/500
147/147 [==============================] - 186s 1s/step - loss: 55.9611
INFO: Training loop in progress
Epoch 389/500
147/147 [==============================] - 185s 1s/step - loss: 57.2047
INFO: Training loop in progress
Epoch 390/500
147/147 [==============================] - 179s 1s/step - loss: 54.2458
INFO: Training loop in progress
Epoch 391/500
147/147 [==============================] - 178s 1s/step - loss: 57.1939
INFO: Training loop in progress
Epoch 392/500
147/147 [==============================] - 177s 1s/step - loss: 54.1291
INFO: Training loop in progress
Epoch 393/500
147/147 [==============================] - 177s 1s/step - loss: 54.7330
INFO: Training loop in progress
Epoch 394/500
147/147 [==============================] - 177s 1s/step - loss: 54.8451
INFO: Training loop in progress
Epoch 395/500
147/147 [==============================] - 176s 1s/step - loss: 55.5947
INFO: Training loop in progress
Epoch 396/500
147/147 [==============================] - 176s 1s/step - loss: 56.2735
INFO: Training loop in progress
Epoch 397/500
147/147 [==============================] - 177s 1s/step - loss: 53.7573
INFO: Training loop in progress
Epoch 398/500
147/147 [==============================] - 176s 1s/step - loss: 55.1412
INFO: Training loop in progress
Epoch 399/500
147/147 [==============================] - 176s 1s/step - loss: 55.1581
INFO: Training loop in progress
Epoch 400/500
147/147 [==============================] - 177s 1s/step - loss: 59.1672
Producing predictions: 100%|████████████████████| 59/59 [00:39<00:00, 1.48it/s]
Start to calculate AP for each class


bus AP 0.73362
car AP 0.46367
motorcycle AP 0.12727
person AP 0.40654
truck AP 0.5142
van AP 0.77238
mAP 0.50295


Validation loss: 34.95020383091296

Epoch 00400: saving model to /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned/weights/yolov4_cspdarknet_tiny_epoch_400.tlt
INFO: Training loop in progress
Epoch 401/500
147/147 [==============================] - 157s 1s/step - loss: 54.0874
INFO: Training loop in progress
Epoch 402/500
147/147 [==============================] - 160s 1s/step - loss: 62.3031
INFO: Training loop in progress
Epoch 403/500
147/147 [==============================] - 178s 1s/step - loss: 55.4003
INFO: Training loop in progress
Epoch 404/500
147/147 [==============================] - 177s 1s/step - loss: 55.4859
INFO: Training loop in progress
Epoch 405/500
147/147 [==============================] - 177s 1s/step - loss: 55.1617
INFO: Training loop in progress
Epoch 406/500
147/147 [==============================] - 176s 1s/step - loss: 53.7364
INFO: Training loop in progress
Epoch 407/500
147/147 [==============================] - 177s 1s/step - loss: 58.6183
INFO: Training loop in progress
Epoch 408/500
147/147 [==============================] - 177s 1s/step - loss: 55.2297
INFO: Training loop in progress
Epoch 409/500
147/147 [==============================] - 176s 1s/step - loss: 57.6367
INFO: Training loop in progress
Epoch 410/500
147/147 [==============================] - 177s 1s/step - loss: 57.8702
INFO: Training loop in progress
Epoch 411/500
147/147 [==============================] - 176s 1s/step - loss: 60.3144
INFO: Training loop in progress
Epoch 412/500
147/147 [==============================] - 176s 1s/step - loss: 58.2854
INFO: Training loop in progress
Epoch 413/500
147/147 [==============================] - 178s 1s/step - loss: 64.2767
INFO: Training loop in progress
Epoch 414/500
147/147 [==============================] - 177s 1s/step - loss: 56.1486
INFO: Training loop in progress
Epoch 415/500
147/147 [==============================] - 177s 1s/step - loss: 58.6552
INFO: Training loop in progress
Epoch 416/500
147/147 [==============================] - 177s 1s/step - loss: 56.8889
INFO: Training loop in progress
Epoch 417/500
147/147 [==============================] - 181s 1s/step - loss: 56.7213
INFO: Training loop in progress
Epoch 418/500
147/147 [==============================] - 179s 1s/step - loss: 53.3411
INFO: Training loop in progress
Epoch 419/500
147/147 [==============================] - 178s 1s/step - loss: 56.2887
INFO: Training loop in progress
Epoch 420/500
147/147 [==============================] - 178s 1s/step - loss: 60.5685
INFO: Training loop in progress
Epoch 421/500
147/147 [==============================] - 178s 1s/step - loss: 58.8606
INFO: Training loop in progress
Epoch 422/500
147/147 [==============================] - 180s 1s/step - loss: 57.7566
INFO: Training loop in progress
Epoch 423/500
147/147 [==============================] - 181s 1s/step - loss: 59.6809
INFO: Training loop in progress
Epoch 424/500
147/147 [==============================] - 188s 1s/step - loss: 57.6117
INFO: Training loop in progress
Epoch 425/500
147/147 [==============================] - 182s 1s/step - loss: 58.0493
Producing predictions: 100%|████████████████████| 59/59 [00:41<00:00, 1.42it/s]
Start to calculate AP for each class


bus AP 0.73624
car AP 0.49817
motorcycle AP 0.13636
person AP 0.45996
truck AP 0.55561
van AP 0.75204
mAP 0.52306


Validation loss: 33.71268529407049

Epoch 00425: saving model to /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned/weights/yolov4_cspdarknet_tiny_epoch_425.tlt
INFO: Training loop in progress

You can attach the log via below button when you reply.
image

Please set proper output_width and output_height based on your training images. It is suggested to match actual resolution as much as possible. But please note that it should match YOLOv4-tiny — TAO Toolkit 4.0 documentation.

Then generate new anchor shapes.

More, in your training spec file, there are only 3 classes. But your training log has 6 classes. Please check if the log is correct.

output_width: 1248
output_height: 384

if I changing these values to the appropriate image input size (for example, 640X480 or 1920X1088) , I receive the following error:
20:08:40.109369: F tensorflow/stream_executor/cuda/redzone_allocator.cc:287] Check failed: !lhs_check.ok() || !rhs_check.ok() Mismatched results with host and device comparison [ef9bb8f429eb:00078] *** Process received signal *** [ef9bb8f429eb:00078] Signal: Aborted (6) [ef9bb8f429eb:00078] Signal code: (-6) [ef9bb8f429eb:00078] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f832c345210] [ef9bb8f429eb:00078] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f832c34518b] [ef9bb8f429eb:00078] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f832c324859] [ef9bb8f429eb:00078] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0xc1b1788)[0x7f82b07c1788] [ef9bb8f429eb:00078]

More, in your training spec file, there are only 3 classes. But your training log has 6 classes. Please check if the log is correct.
yes I deleted it to make it short

What is the average resolution of your training images? Please set it in output_width and output_height, then calculate anchor shapes again, then run training.
More, your training log is not compatible with training spec. You only set 3 classes in training spec, so the training log will have AP for 3 classes.

hey, thank you for your response,
I changed the value to the image resolution (1920X 1056), recalculated the anchors, and executed the training.
After two epochs I received the following error

INFO: Starting Training Loop.
Epoch 1/500
  1/147 [..............................] - ETA: 1:27:12 - loss: 18511.1445WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:186: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:186: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

  2/147 [..............................] - ETA: 54:25 - loss: 18470.6797  /usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (3.425583). Check your callbacks.
  % delta_t_median)
147/147 [==============================] - 557s 4s/step - loss: 20875.1052
6ba03ea3d679:61:100 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
6ba03ea3d679:61:100 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
6ba03ea3d679:61:100 [0] NCCL INFO P2P plugin IBext
6ba03ea3d679:61:100 [0] NCCL INFO NET/IB : No device found.
6ba03ea3d679:61:100 [0] NCCL INFO NET/IB : No device found.
6ba03ea3d679:61:100 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
6ba03ea3d679:61:100 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
6ba03ea3d679:61:100 [0] NCCL INFO Channel 00/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 01/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 02/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 03/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 04/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 05/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 06/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 07/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 08/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 09/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 10/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 11/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 12/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 13/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 14/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 15/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 16/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 17/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 18/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 19/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 20/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 21/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 22/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 23/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 24/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 25/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 26/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 27/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 28/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 29/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 30/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Channel 31/32 :    0
6ba03ea3d679:61:100 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
6ba03ea3d679:61:100 [0] NCCL INFO Connected all rings
6ba03ea3d679:61:100 [0] NCCL INFO Connected all trees
6ba03ea3d679:61:100 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
6ba03ea3d679:61:100 [0] NCCL INFO comm 0x7fac58793ae0 rank 0 nranks 1 cudaDev 0 busId 1e0 - Init COMPLETE
INFO: Training loop in progress
Epoch 2/500
120/147 [=======================>......] - ETA: 2:10 - loss: 20298.7283INFO: 2 root error(s) found.
  (0) Invalid argument: Input to reshape is a tensor with 2418 values, but the requested shape has 2379
	 [[{{node bg_anchor_1/Reshape}}]]
	 [[cond_129/SliceReplace_1/range/_9117]]
  (1) Invalid argument: Input to reshape is a tensor with 2418 values, but the requested shape has 2379
	 [[{{node bg_anchor_1/Reshape}}]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py", line 145, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 707, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 695, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py", line 141, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py", line 126, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py", line 77, in run_experiment
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/models/yolov4_model.py", line 692, in train
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1039, in fit
    validation_steps=validation_steps)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py", line 154, in fit_loop
    outs = f(ins)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1472, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Input to reshape is a tensor with 2418 values, but the requested shape has 2379
	 [[{{node bg_anchor_1/Reshape}}]]
	 [[cond_129/SliceReplace_1/range/_9117]]
  (1) Invalid argument: Input to reshape is a tensor with 2418 values, but the requested shape has 2379
	 [[{{node bg_anchor_1/Reshape}}]]
0 successful operations.
0 derived errors ignored.
2023-01-01 16:05:22,231 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

To narrow down, can you use less tfrecords and try again?
You can use bisection method to check which can work.

thank for the advice, but could you please explain how I can use “less tfrecords” or use “bisection method” ?

by the way, the dataset I’m using contains :

Number of images in the train/val set. 1174
Number of labels in the train/val set. 1174
Number of images in the test set. 28

tfrecord_train config

kitti_config {
  root_directory_path: "/workspace/tao-experiments/data/training"
  image_dir_name: "image"
  label_dir_name: "label"
  image_extension: ".png"
  partition_mode: "random"
  num_partitions: 2
  val_split: 14
  num_shards: 2
}
image_directory_path: "/workspace/tao-experiments/data/training"

You can backup above tfrecord folder. And then use part of it to do experiments.

Is it recommended to use the tfrecords, because I don’t see any benefit from it…
Is there a way to avoid using this method? what is your opinion on that?

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Above experiment is to find the culprit which dataset is not working.
Do you ever try with less tfrecords files? Can it run successfully?

For your latest question, you can try sequence format to do experiments. More info can be found in YOLOv4-tiny — TAO Toolkit 4.0 documentation.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.