I am training FRCNN in TLT for Resnet18 and Mobilenet_v2.
Available models are seen and listed using ngc registry model list nvidia/iva/tlt_*
Downloaded Resnet18 and Mobilenet_v2 using the following commands.
ngc registry model download-version nvidia/iva/tlt_resnet18_faster_rcnn:1
ngc registry model download-version nvidia/iva/tlt_mobilenet_v2_faster_rcnn:1
Both failed in training with different issues.
For Mobilenet_v2, training failed with
Traceback (most recent call last):
File "/usr/local/bin/tlt-train-g1", line 8, in <module>
sys.exit(main())
File "./common/magnet_train.py", line 30, in main
File "./faster_rcnn/scripts/train.py", line 273, in main
File "./faster_rcnn/data_loader/loader.py", line 200, in kitti_data_gen
UnboundLocalError: local variable 'image_channel_order' referenced before assignment
Resnet18 failed with
2020-04-07 03:44:08,525 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/train.pyc: Loading pretrained weights from /workspace/tlt_resnet18_faster_rcnn_v1/resnet18.h5
Traceback (most recent call last):
File "/usr/local/bin/tlt-train-g1", line 8, in <module>
sys.exit(main())
File "./common/magnet_train.py", line 30, in main
File "./faster_rcnn/scripts/train.py", line 232, in main
File "/usr/local/lib/python2.7/dist-packages/keras/engine/network.py", line 1163, in load_weights
reshape=reshape)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/saving.py", line 1130, in load_weights_from_hdf5_group_by_name
' element(s).')
ValueError: Layer #4 (named "block_1a_conv_1") expects 1 weight(s), but the saved weights have 2 element(s).
My TLT version is latest nvcr.io/nvidia/tlt-streamanalytics:v1.0.1_py2
How can I fixed the issues?
spec files for Resnet and Mobilenet are as follows.
Sorry Sir.
The spec files for Resnet and Mobilenet are attached. specs_frcnn.log (3.5 KB) specs_mobilenet_v2.log (3.4 KB)
The extensions were changed to log if not, they can’t be submitted.
My trained image size is 736 x 736 (multiple of 32).
The “feature_extractor” field should match your backbone. From your specs_mobilenet_v2.log, it is wrong.
For MobileNet V1/V2, if we want to load the pretrained weights in NGC for training/retrain, we should set the “conv_bn_share_bias” field in the experiment_spec file to be “True” . For all other backbones, if we want to load the pretrained weights in NGC for training/retrain, we should set them to be “False”.
Thanks I have updated as you mentioned.
Now both resnet:18 and mobilenet_v2 have similar error as follows.
2020-04-08 05:10:17,659 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/train.pyc: Loading pretrained weights from /workspace/tlt-experiments/FasterRCNN_18/resnet18.h5
2020-04-08 05:10:19,319 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/train.pyc: Pretrained weights loaded!
2020-04-08 05:10:19,515 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/train.pyc: training example num: 4579
2020-04-08 05:10:19,657 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/train.pyc: Starting training
2020-04-08 05:10:19,657 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/train.pyc: Epoch 1/7
Found 4579 examples in training dataset, valid image extension isjpg, jpeg and png(case sensitive)
Compressed_class_mapping: {u'plate': 0, u'background': 2, u'textline': 1}
Name mapping:{u'plate': u'plate', u'background': u'background', u'textline': u'textline'}
Training dataset stats(compressed via class mapping):
{u'plate': 5164, u'background': 0, u'textline': 6586}
Traceback (most recent call last):
File "/usr/local/bin/tlt-train-g1", line 8, in <module>
sys.exit(main())
File "./common/magnet_train.py", line 30, in main
File "./faster_rcnn/scripts/train.py", line 273, in main
File "./faster_rcnn/data_loader/loader.py", line 200, in kitti_data_gen
UnboundLocalError: local variable 'image_channel_order' referenced before assignment
The nature of this error UnboundLocalError: local variable 'image_channel_order' referenced before assignment doesn’t matter rgb or bgr channel order.
It is because of variable assignment without initialization in the source code.