Tlt-train always errors on No such file or directory: 'trained/model.step-0.ckzip'

I’m using Detectnet V2, human head detection. Got my dataset, I isolated 100 images for this to find the cause of the error, made the tfrecord folds using tlt-dataset-convert. All images resized (960x544) to match the network input and bounding boxes adjusted. Upon running tlt-train, I can see that graph.pbtxt is produced, but I don’t have any checkpoints being made, and the train process errors with:

2020-05-06 13:33:46,120 [INFO] iva.detectnet_v2.scripts.train: Found 95 samples in training set
...
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
...
2020-05-06 13:34:05,899 [INFO] iva.detectnet_v2.scripts.train: Found 5 samples in validation set
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 47, in main
  File "<decorator-gen-2>", line 2, in main
  File "./detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
  File "./detectnet_v2/scripts/train.py", line 667, in main
  File "./detectnet_v2/scripts/train.py", line 591, in run_experiment
  File "./detectnet_v2/scripts/train.py", line 525, in train_gridbox
  File "./detectnet_v2/scripts/train.py", line 142, in run_training_loop
  File "./detectnet_v2/training/utilities.py", line 143, in get_singular_monitored_session
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1021, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 650, in __init__
    self._sess = self._coordinated_creator.create_session()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 812, in create_session
    hook.after_create_session(self.tf_sess, self.coord)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 568, in after_create_session
    self._save(session, global_step)
  File "./detectnet_v2/tfhooks/checkpoint_saver_hook.py", line 77, in _save
  File "./detectnet_v2/tfhooks/checkpoint_saver_hook.py", line 110, in _save_encrypted_checkpoint
IOError: [Errno 2] No such file or directory: 'trained/model.step-0.ckzip'

Shouldn’t the training process create these checkpoints? In training_config the checkpoint_interval is set to 10. I am using docker with mounted volume from the host PC, if that is of any help. Using this to launch the training:

root@351b316abd94:/usr/src/headcount# tlt-train detectnet_v2 -e detectnet_v2_train_resnet18_kitti.txt -r trained -k tlt_encode -n resnet18_detector --gpus 1

Spec file:

random_seed: 42
dataset_config {
  data_sources {
    tfrecords_path: "/usr/src/headcount/tfrecords/*"
    image_directory_path: "/usr/src/headcount/tinyset"
  }
  image_extension: "jpeg"
  target_class_mapping {
    key: "head"
    value: "head"
  }
  validation_fold: 0
}
model_config {
  arch: "resnet"
  pretrained_model_file: "/usr/src/headcount/pretrained/tlt_pretrained_detectnet_v2_vresnet18/resnet18.hdf5"
  freeze_blocks: 0
  freeze_blocks: 1
  all_projections: True
  num_layers: 18
  use_pooling: False
  use_batch_norm: True
  dropout_rate: 0.0
  training_precision {
    backend_floatx: FLOAT32
  }
  objective_set {
    cov { }
    bbox {
      scale: 35.0
      offset: 0.5
    }
  }
}
bbox_rasterizer_config {
  target_class_config {
    key: "head"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  deadzone_radius: 0.67
}
postprocessing_config {
  target_class_config {
    key: "head"
    value {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 20
      }
    }
  }
}
cost_function_config {
  target_classes {
    name: "head"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  enable_autoweighting: True
  min_objective_weight: 0.0001
  max_objective_weight: 0.9999
}
training_config {
  batch_size_per_gpu: 26
  num_epochs: 80
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-06
      max_learning_rate: 5e-04
      soft_start: 0.1
      annealing: 0.7
    }
  }
  regularizer {
    type: L1
    weight: 3e-9
  }
  optimizer {
    adam {
      epsilon: 1e-08
      beta1: 0.9
      beta2: 0.999
    }
  }
  cost_scaling {
    enabled: False
    initial_exponent: 20.0
    increment: 0.005
    decrement: 1.0
  }
  checkpoint_interval: 1
}
augmentation_config {
  preprocessing {
    output_image_width: 960
    output_image_height: 544
    output_image_channel: 3
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {
    hflip_probability: 0.5
    vflip_probability: 0.0
    zoom_min: 1.0
    zoom_max: 1.0
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
}
evaluation_config {
  average_precision_mode: INTEGRATE
  validation_period_during_training: 10
  first_validation_epoch: 1 
  minimum_detection_ground_truth_overlap {
    key: "head"
    value: 0.5
  }
  evaluation_box_config {
    key: "head"
    value {
      minimum_height: 4
      maximum_height: 9999
      minimum_width: 4
      maximum_width: 9999
    }
  }
}

Couldn’t find anything related to this error, what is the problem here? Any help appreciated.

Please set the absolute path for the result folder.
Reference: CostFunctionConfig should have at least one class - #8 by Morganh

My searching effort obv wasn’t good enough. Thank you!