TAO yoloV4 cannot train from checkpoint

• Network Type Yolo_v4
• TLT Version: toolkit_version: 3.22.02
published_date: 02/28/2022

nvidia/tao/tao-toolkit-tf:
v3.21.11-tf1.15.5-py3:

• Training spec file

random_seed: 42
yolov4_config {
  big_anchor_shape: "[(256.60, 62.21), (318.49, 173.14), (801.37, 278.01)]"
  mid_anchor_shape: "[(59.92, 52.12), (122.15, 30.62), (123.42, 104.64)]"
  small_anchor_shape: "[(16.30, 9.87), (29.25, 26.46), (59.63, 14.60)]"
  box_matching_iou: 0.25
  matching_neutral_box_iou: 0.5
  arch: "resnet"
  nlayers: 18
  arch_conv_blocks: 2
  loss_loc_weight: 0.8
  loss_neg_obj_weights: 100.0
  loss_class_weights: 0.5
  label_smoothing: 0.0
  big_grid_xy_extend: 0.05
  mid_grid_xy_extend: 0.1
  small_grid_xy_extend: 0.2
  freeze_bn: false
  #freeze_blocks: 0
  force_relu: false
}
training_config {
  batch_size_per_gpu: 16
  num_epochs: 10
  enable_qat: false
  checkpoint_interval: 3
  learning_rate {
    soft_start_cosine_annealing_schedule {
      min_learning_rate: 1e-7
      max_learning_rate: 1e-4
      soft_start: 0.3
    }
  }
  regularizer {
    type: L1
    weight: 3e-5
  }
  optimizer {
    adam {
      epsilon: 1e-7
      beta1: 0.9
      beta2: 0.999
      amsgrad: false
    }
  }
  resume_model_path: "/workspace/tao-experiments/yolo_v4/experiment_dir_unpruned/weights/yolov4_resnet18_epoch_010.tlt"
}
eval_config {
  average_precision_mode: SAMPLE
  batch_size: 8
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.001
  clustering_iou_threshold: 0.5
  force_on_cpu: true
  top_k: 200
}
augmentation_config {
  hue: 0.1
  saturation: 1.5
  exposure:1.5
  vertical_flip:0
  horizontal_flip: 0.5
  jitter: 0.3
  output_width: 1248
  output_height: 384
  output_channel: 3
  randomize_input_shape_period: 0
  mosaic_prob: 0.5
  mosaic_min_ratio:0.2
}
dataset_config {
  data_sources: {
      tfrecords_path: "/workspace/tao-experiments/data/training/tfrecords/train*"
      image_directory_path: "/workspace/tao-experiments/data/training"
  }
  include_difficult_in_training: true
  image_extension: "jpg"
  target_class_mapping {
    key: "mouse"
    value: "mouse"
  }
  target_class_mapping {
    key: "keyboard"
    value: "keyboard"
  }
  target_class_mapping {
    key: "person"
    value: "person"
  }
  target_class_mapping {
    key: "bicycle"
    value: "bicycle"
  }
  target_class_mapping {
    key: "spoon"
    value: "spoon"
  }
  target_class_mapping {
    key: "bowl"
    value: "bowl"
  }
  target_class_mapping {
    key: "cat"
    value: "cat"
  }
  target_class_mapping {
    key: "bed"
    value: "bed"
  }
  target_class_mapping {
    key: "toilet"
    value: "toilet"
  }
  target_class_mapping {
    key: "motorcycle"
    value: "motorcycle"
  }
  target_class_mapping {
    key: "car"
    value: "car"
  }
  target_class_mapping {
    key: "bottle"
    value: "bottle"
  }
  target_class_mapping {
    key: "sink"
    value: "sink"
  }
  target_class_mapping {
    key: "pizza"
    value: "pizza"
  }
  target_class_mapping {
    key: "microwave"
    value: "microwave"
  }
  target_class_mapping {
    key: "chair"
    value: "chair"
  }
  target_class_mapping {
    key: "fork"
    value: "fork"
  }
  target_class_mapping {
    key: "truck"
    value: "truck"
  }
  target_class_mapping {
    key: "dog"
    value: "dog"
  }
  target_class_mapping {
    key: "refrigerator"
    value: "refrigerator"
  }
  target_class_mapping {
    key: "oven"
    value: "oven"
  }
  target_class_mapping {
    key: "knife"
    value: "knife"
  }
  target_class_mapping {
    key: "cell_phone"
    value: "cell_phone"
  }
  target_class_mapping {
    key: "laptop"
    value: "laptop"
  }
  target_class_mapping {
    key: "book"
    value: "book"
  }
  target_class_mapping {
    key: "handbag"
    value: "handbag"
  }
  target_class_mapping {
    key: "traffic_light"
    value: "traffic_light"
  }
  target_class_mapping {
    key: "backpack"
    value: "backpack"
  }
  target_class_mapping {
    key: "scissors"
    value: "scissors"
  }
  target_class_mapping {
    key: "bus"
    value: "bus"
  }
  target_class_mapping {
    key: "bench"
    value: "bench"
  }
  target_class_mapping {
    key: "couch"
    value: "couch"
  }
  target_class_mapping {
    key: "remote"
    value: "remote"
  }
  target_class_mapping {
    key: "sandwich"
    value: "sandwich"
  }
  target_class_mapping {
    key: "stop_sign"
    value: "stop_sign"
  }
  target_class_mapping {
    key: "toaster"
    value: "toaster"
  }
  validation_data_sources: {
      tfrecords_path: "/workspace/tao-experiments/data/val/tfrecords/val*"
      image_directory_path: "/workspace/tao-experiments/data/val"
  }
}

the erorr is

Total params: 20,304,305
Trainable params: 20,282,161
Non-trainable params: 22,144
__________________________________________________________________________________________________
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:7: The name tf.local_variables_initializer is deprecated. Please use tf.compat.v1.local_variables_initializer instead.

2022-08-04 14:19:14,663 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:7: The name tf.local_variables_initializer is deprecated. Please use tf.compat.v1.local_variables_initializer instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:8: The name tf.tables_initializer is deprecated. Please use tf.compat.v1.tables_initializer instead.

2022-08-04 14:19:14,664 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:8: The name tf.tables_initializer is deprecated. Please use tf.compat.v1.tables_initializer instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:9: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

2022-08-04 14:19:14,665 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:9: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

2022-08-04 14:19:36,582 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Can you share the steps how did you run?

Please share the command line.

print("To run with multigpu, please change --gpus based on the number of available GPUs in your machine.")
!tao yolo_v4 train -e $SPECS_DIR/yolo_v4_train_resnet18_kitti.txt \
                   -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned_v \
                   -k $KEY \
                   --gpus 1

Please try to use below way to debug.
Open a terminal.

$ tao yolo_v4 run /bin/bash

then, inside the docker, run again.
# yolo_v4 train xxx

Apparently from terminal it works, why?..

In notebook, mask sure all the env is setup.
For example, the $KEY is available.

You can try again in notebook with explicit command.

All are set, because…with pretrained_model_path it runs…with resume_model_path it doenst run,

It does not make sense. You can double check.