Faster RCNN ResNet-101 Problems

Hi, I am retrying nvidia-tlt after more than three months, due to the release of DeepStream 5.0 and and other improvements, notably the availability of ResNet-101 as backbone. In order to recall how everything works, I am going through the ipynb example. The config file is exactly as it is the Docker container, except for some paths. I have two problems at the moment:

  1. Training worked ok, as losses decreased during the first few epochs, though they started going up after epoch 4 or so. But evaluation is terrible: I am getting zero mAP, zero precision and zero recall. I ran the visualisation, and turns out the model is predicting the same box at the right bottom corner for every image.

Training:

Epoch 1/12
6434/6434 [==============================] - 5866s 912ms/step - loss: 0.5379 - rpn_out_class_loss: 0.1280 - rpn_out_regress_loss: 0.0147 - dense_class_td_loss: 0.1353 - dense_regress_td_loss: 0.0825 - dense_class_td_acc: 0.9660
Epoch 2/12
6434/6434 [==============================] - 5409s 841ms/step - loss: 0.3461 - rpn_out_class_loss: 0.1283 - rpn_out_regress_loss: 0.0129 - dense_class_td_loss: 0.1016 - dense_regress_td_loss: 0.0622 - dense_class_td_acc: 0.9731
Epoch 3/12
6434/6434 [==============================] - 5387s 837ms/step - loss: 0.3563 - rpn_out_class_loss: 0.1272 - rpn_out_regress_loss: 0.0125 - dense_class_td_loss: 0.1111 - dense_regress_td_loss: 0.0685 - dense_class_td_acc: 0.9702
Epoch 4/12
6434/6434 [==============================] - 5383s 837ms/step - loss: 0.3463 - rpn_out_class_loss: 0.1269 - rpn_out_regress_loss: 0.0124 - dense_class_td_loss: 0.1062 - dense_regress_td_loss: 0.0649 - dense_class_td_acc: 0.9714
Epoch 5/12
6434/6434 [==============================] - 5385s 837ms/step - loss: 0.3914 - rpn_out_class_loss: 0.1267 - rpn_out_regress_loss: 0.0122 - dense_class_td_loss: 0.1343 - dense_regress_td_loss: 0.0831 - dense_class_td_acc: 0.9643
Epoch 6/12
6434/6434 [==============================] - 5379s 836ms/step - loss: 0.3680 - rpn_out_class_loss: 0.1267 - rpn_out_regress_loss: 0.0122 - dense_class_td_loss: 0.1209 - dense_regress_td_loss: 0.0735 - dense_class_td_acc: 0.9681
Epoch 7/12
6434/6434 [==============================] - 5371s 835ms/step - loss: 0.3707 - rpn_out_class_loss: 0.1266 - rpn_out_regress_loss: 0.0121 - dense_class_td_loss: 0.1224 - dense_regress_td_loss: 0.0753 - dense_class_td_acc: 0.9672
Epoch 8/12
6434/6434 [==============================] - 5372s 835ms/step - loss: 0.3709 - rpn_out_class_loss: 0.1266 - rpn_out_regress_loss: 0.0121 - dense_class_td_loss: 0.1226 - dense_regress_td_loss: 0.0757 - dense_class_td_acc: 0.9672

Evaluation:

2020-05-22 21:39:41,472 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/test.pyc: 1046/1047
2020-05-22 21:39:41,767 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/test.pyc: Elapsed time = 0.294595956802
================================================================================
Class AP precision recall RPN_recall


cyclist 0.0000 0.0000 0.0000 0.0425


car 0.0000 0.0000 0.0000 0.1037


person 0.0000 0.0000 0.0000 0.0437


mAP = 0.0000

  1. I went ahead and tried to execute the model export sections. But even this is not working.

Using TensorFlow backend.
Traceback (most recent call last):
File “/usr/local/bin/tlt-export”, line 8, in
sys.exit(main())
File “./common/export/app.py”, line 221, in main
File “./common/export/base_exporter.py”, line 69, in set_keras_backend_dtype
File “./common/utils.py”, line 189, in get_decoded_filename
IOError: Invalid decryption. Unable to open file (File signature not found). The key used to load the model is incorrect.

Can you please help? Thanks

Are you using 2.0_dp version docker now? If yes, please recheck your images/labels. Because for new 2.0_dp docker, faster-rcnn does not support training on images of multiple resolutions, or resizing images during training. So, all of the images must be resized offline to the final training size and the corresponding bounding boxes must be scaled accordingly.
For your 2nd question, please make sure the API key is correct.

The images are from the KITTI dataset. As per the Jupyter notebook, I downloaded the zip folders, unzipped them, and converted them to TF Records. When I ran the visualisation later, I noticed that the images were all slightly different in size (e.g. 1224 x 370, 1242 x 375, …) I’d assumed they would all be the size specified in the default specs. But don’t you think the problem here is during evaluation and inference, and not necessarily during training?

For the key, I’ve double checked. It’s the same key that I used for downloading the data and for training, both steps which worked ok. Only thing I could try is to generate a new key as this key is from January.

Yes, if you were using KITTI dataset, you need not resize because your setting in training spec matches the average resolution of KITTI dataset. Unfortunately, I can reproduce the issue you mentioned, I will sync with internal team about that. Will update to you if there is any finding.
For the key, you need not to generate a new key. Just please to confirm

  1. key is correct
  2. $key is not empty , $key is correct
  3. you were training this tlt model with the same key.

Hi @morganh, the key issue seems to be fine now. I think I had a mistake with single quote marks around the key ‘$KEY’. Please let me know when you have some news on the evaluation / inference. Thanks!

For mAP issue, looks like the pretrained weights is not so good. So, please do not freeze any CNN blocks in the spec file, i.e., do not specify any freeze_blocks in it. Also do not freeze_bn.
-freeze_bn: True
-freeze_blocks: 0
-freeze_blocks: 1
+freeze_bn: False

Try to run training with batchsize 1 on a single GPU. More, you may meet OOM error since ResNet101 is a big backbone which requires more GPU memory. In this case, please try another GPU.
If user have more GPU memory, he can also increase batch size to get better mAP. But basically, ResNet101 is big and cannot use batch size 16 on a single GPU.