Retrain a ssd model

pangqingshuang · January 9, 2020, 2:55am

I want to retrain a ssd model but encounter the following error

Message type “EvalConfig” has no field named “averge_precision_mode”.

which pointing to the following part in configure file(I followed the document)

eval_config {
validation_period_during_training: 10
averge_precision_mode: SAMPLE
matching_iou_threshold: 0.5
}

After I delete “averge_precision_mode: SAMPLE”
this error disappeared.

But another error shows.
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “./common/magnet_train.py”, line 33, in main
File “./ssd/scripts/train.py”, line 301, in main
File “./ssd/scripts/train.py”, line 168, in run_experiment
File “./ssd/builders/inputs_builder.py”, line 51, in init
File “./detectnet_v2/dataloader/default_dataloader.py”, line 203, in get_dataset_tensors
File “./detectnet_v2/dataloader/default_dataloader.py”, line 229, in _generate_images_and_ground_truth_labels
File “./modulus/processors/processors.py”, line 227, in call
File “./detectnet_v2/dataloader/utilities.py”, line 60, in call
File “./modulus/processors/tfrecords_iterator.py”, line 143, in process_records
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py”, line 1508, in split
axis=axis, num_split=num_or_size_splits, value=value, name=name)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py”, line 8883, in split
“Split”, split_dim=axis, value=value, num_split=num_split, name=name)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py”, line 709, in _apply_op_helper
(key, op_type_name, attr_value.i, attr_def.minimum))
ValueError: Attr ‘num_split’ of ‘Split’ Op passed 0 less than minimum 1.

For me tlt looks like a black box.
So anyone can help me how to debug in this situation???

;)
life is sooooo difficult

Morganh · January 9, 2020, 3:41am

Hi pangqingshuang,
Thanks for using TLT! I have a question: You got error while retraining SSD. So you does not meet the error in training, just failed in retraining?

pangqingshuang · January 10, 2020, 2:59pm

Hi Mroganh, I didnt try training process for SSD

Morganh · January 10, 2020, 5:14pm

Hi pangqingshuang,
Could you show more details, how did you generate the tfrecords? The error may result from tfrecord file.
Have you followed the Jupyter SSD sample to have a try run? That may help you better understand the process.

Morganh · January 10, 2020, 5:33pm

Also, you mentioned that you did not train SSD.But If you did not train, how did you retrain? Could you paste the full command along with the retaining spec?thanks.

pangqingshuang · January 11, 2020, 6:09am

I just use the pretrained model download from docker. For the suggestion you mentioned above, I will try it and poat the feedback soon. Thanks.

kayccc · February 4, 2020, 2:26am

Hi pangqingshuang,

We haven’t heard back from you in a couple weeks, so marking this topic closed.
Please open a new forum issue when you are ready and we’ll pick it up there.