KeyError: 'customer' .....when running tlt-train

Hi,

Firstly thank you for the TLT general release! For those of us (me including) who couldn’t make it to the Early Access Program, this is a wait which is finally over and I am really excited!

After downloading and installing the TLT docker image and preparing my dataset as per the instructions in the getting started guide, I ran a training cycle using the “tlt_resnet18_detectnet_v2_v1” model.

I am fine-tuning the resnet18 model on a single class called ‘customer’.

However at the end of the first eval cycle I get this error:

2019-09-26 21:22:13,642 [INFO] tensorflow: global_step/sec: 2.55173
INFO:tensorflow:epoch = 0.9527027027027027, loss = 0.008371737, step = 141 (5.488 sec)
2019-09-26 21:22:14,041 [INFO] tensorflow: epoch = 0.9527027027027027, loss = 0.008371737, step = 141 (5.488 sec)
2019-09-26 21:22:16,915 [INFO] iva.detectnet_v2.evaluation.evaluation: step 0 / 36, 0.00s/step
/usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/evaluation/metadata.py:38: UserWarning: One or more metadata field(s) are missing from ground_truth batch_data, and will be replaced with defaults: ['frame/camera_location']
2019-09-26 21:22:28,510 [INFO] iva.detectnet_v2.evaluation.evaluation: step 10 / 36, 1.16s/step
2019-09-26 21:22:36,094 [INFO] iva.detectnet_v2.evaluation.evaluation: step 20 / 36, 0.76s/step
2019-09-26 21:22:43,573 [INFO] iva.detectnet_v2.evaluation.evaluation: step 30 / 36, 0.75s/step
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 10, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 37, in main
  File "</usr/local/lib/python2.7/dist-packages/decorator.pyc:decorator-gen-2>", line 2, in main
  File "./detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
  File "./detectnet_v2/scripts/train.py", line 632, in main
  File "./detectnet_v2/scripts/train.py", line 556, in run_experiment
  File "./detectnet_v2/scripts/train.py", line 490, in train_gridbox
  File "./detectnet_v2/scripts/train.py", line 136, in run_training_loop
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1335, in run
    run_metadata=run_metadata))
  File "./detectnet_v2/tfhooks/validation_hook.py", line 69, in after_run
  File "./detectnet_v2/tfhooks/validation_hook.py", line 75, in validate
  File "./detectnet_v2/evaluation/evaluation.py", line 179, in evaluate
  File "./detectnet_v2/evaluation/compute_metrics.py", line 161, in __init__
  File "./detectnet_v2/evaluation/compute_metrics.py", line 343, in _prepare_internal_structures
  File "./detectnet_v2/evaluation/compute_metrics.py", line 301, in _check_if_bbox_is_valid
KeyError: 'customer'

my training commnad is:

tlt-train detectnet_v2 --gpus 1 \
                       -r /workspace/nvidia_experiment/training_output \
                       -e /workspace/nvidia_experiment/training.config \
		       -n nvidia_experiment_1 \
                       -k $MY_API_KEY

and my training config is:

dataset_config {
 data_sources: {
   tfrecords_path: "/workspace/nvidia_experiment/dataset/tfrecords/*"
   image_directory_path: "/workspace/nvidia_experiment/dataset/"
 }
 image_extension: "jpeg"
 target_class_mapping {
   key: "customer"
   value: "customer"
 }
 validation_fold: 0
}

augmentation_config {
 preprocessing {
   output_image_width: 640
   output_image_height: 480
   output_image_channel: 3
   min_bbox_width: 1.0
   min_bbox_height: 1.0
 }
 spatial_augmentation {
   hflip_probability: 0.5
   zoom_min: 1.0
   zoom_max: 1.0
   translate_max_x: 8.0
   translate_max_y: 8.0
 }
 color_augmentation {
   hue_rotation_max: 25.0
   saturation_shift_max: 0.2
   contrast_scale_max: 0.1
   contrast_center: 0.5
 }
}

model_config {
  # Model architecture can be chosen from:
  # ['resnet', 'vgg', 'googlenet', 'alexnet', 'mobilenet_v1', 'mobilenet_v2', 'squeezenet']
  arch: "resnet"  
  pretrained_model_file: "/workspace/nvidia_experiment/tlt_resnet18_detectnet_v2_v1/resnet18.hdf5"  
  #we are freezing the first two conv block to maintain features learnt in pretraining
  freeze_blocks: 0
  freeze_blocks: 1  
  all_projections: True
  # for resnet --> n_layers can be [10, 18, 50]
  # for vgg --> n_layers can be [16, 19]
  num_layers: 18
  #use_bias: True
  use_pooling: False
  use_batch_norm: True
  dropout_rate: 0.0
  freeze_bn: False
  training_precision: {
    backend_floatx: FLOAT32
  }
  objective_set: {
    cov {}
    bbox {
      scale: 35.0
      offset: 0.5
    }
  }
}

training_config {
  batch_size_per_gpu: 16
  num_epochs: 80
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-6
      max_learning_rate: 5e-4
      soft_start: 0.1
      annealing: 0.7
    }
  }
  regularizer {
    type: L1
    weight: 3e-9
  }
  optimizer {
    adam {
      epsilon: 1e-08
      beta1: 0.9
      beta2: 0.999
    }
  }
  cost_scaling {
    enabled: False
    initial_exponent: 20.0
    increment: 0.005
    decrement: 1.0
  }
  checkpoint_interval: 10
}

evaluation_config {
 average_precision_mode: INTEGRATE
 validation_period_during_training: 10
 first_validation_epoch: 1
 minimum_detection_ground_truth_overlap {
   key: "customer"
   #value is the IoU value
   value: 0.5
 }
}

bbox_rasterizer_config {
 target_class_config {
   key: "customer"
   value {
     cov_center_x: 0.5
     cov_center_y: 0.5
     cov_radius_x: 0.4
     cov_radius_y: 0.4
     bbox_min_radius: 1.0
   }
 }
 deadzone_radius: 0.67
}

postprocessing_config {
 target_class_config {
   key: "customer"
   value {
     clustering_config {
       coverage_threshold: 0.005
       dbscan_eps: 0.13
       dbscan_min_samples: 0.05
       minimum_bounding_box_height: 4
     }
   }
 }
}

cost_function_config {
 target_classes {
   name: "customer"
   class_weight: 1.0
   coverage_foreground_weight: 0.05
   objectives {
     name: "cov"
     initial_weight: 1.0
     weight_target: 1.0
   }
   objectives {
     name: "bbox"
     initial_weight: 10.0
     weight_target: 10.0
   }
 }
 enable_autoweighting: True
 max_objective_weight: 0.9999
 min_objective_weight: 0.0001
}

I am unable to find where this KeyError is happening… please help!

Hi pushkar,
Your evaluation config is as below.

evaluation_config {
 average_precision_mode: INTEGRATE
 validation_period_during_training: 10
 first_validation_epoch: 1
 minimum_detection_ground_truth_overlap {
   key: "customer"
   #value is the IoU value
   value: 0.5
 }
}

It is missing “evaluation_box_config”. This nested configuration field configures the min and max box dimensions to be considered as a valid ground truth and prediction for AP calculation.
So, please add it into your evaluation config, like

evaluation_config {
 average_precision_mode: INTEGRATE
 validation_period_during_training: 10
 first_validation_epoch: 1
 minimum_detection_ground_truth_overlap {
   key: "customer"
   #value is the IoU value
   value: 0.5
 }
  [b]evaluation_box_config {
    key: "customer"
    value {
      minimum_height: 4
      maximum_height: 9999
      minimum_width: 4
      maximum_width: 9999
    }
  }[/b]
}

Hi Morgan,

Thanks for your help. tlt-train is now running perfectly.