Faster RCNN ResNet-101 Problems

cbasavaraj · May 22, 2020, 10:41pm

Hi, I am retrying nvidia-tlt after more than three months, due to the release of DeepStream 5.0 and and other improvements, notably the availability of ResNet-101 as backbone. In order to recall how everything works, I am going through the ipynb example. The config file is exactly as it is the Docker container, except for some paths. I have two problems at the moment:

Training worked ok, as losses decreased during the first few epochs, though they started going up after epoch 4 or so. But evaluation is terrible: I am getting zero mAP, zero precision and zero recall. I ran the visualisation, and turns out the model is predicting the same box at the right bottom corner for every image.

Training:

Epoch 1/12
6434/6434 [==============================] - 5866s 912ms/step - loss: 0.5379 - rpn_out_class_loss: 0.1280 - rpn_out_regress_loss: 0.0147 - dense_class_td_loss: 0.1353 - dense_regress_td_loss: 0.0825 - dense_class_td_acc: 0.9660
Epoch 2/12
6434/6434 [==============================] - 5409s 841ms/step - loss: 0.3461 - rpn_out_class_loss: 0.1283 - rpn_out_regress_loss: 0.0129 - dense_class_td_loss: 0.1016 - dense_regress_td_loss: 0.0622 - dense_class_td_acc: 0.9731
Epoch 3/12
6434/6434 [==============================] - 5387s 837ms/step - loss: 0.3563 - rpn_out_class_loss: 0.1272 - rpn_out_regress_loss: 0.0125 - dense_class_td_loss: 0.1111 - dense_regress_td_loss: 0.0685 - dense_class_td_acc: 0.9702
Epoch 4/12
6434/6434 [==============================] - 5383s 837ms/step - loss: 0.3463 - rpn_out_class_loss: 0.1269 - rpn_out_regress_loss: 0.0124 - dense_class_td_loss: 0.1062 - dense_regress_td_loss: 0.0649 - dense_class_td_acc: 0.9714
Epoch 5/12
6434/6434 [==============================] - 5385s 837ms/step - loss: 0.3914 - rpn_out_class_loss: 0.1267 - rpn_out_regress_loss: 0.0122 - dense_class_td_loss: 0.1343 - dense_regress_td_loss: 0.0831 - dense_class_td_acc: 0.9643
Epoch 6/12
6434/6434 [==============================] - 5379s 836ms/step - loss: 0.3680 - rpn_out_class_loss: 0.1267 - rpn_out_regress_loss: 0.0122 - dense_class_td_loss: 0.1209 - dense_regress_td_loss: 0.0735 - dense_class_td_acc: 0.9681
Epoch 7/12
6434/6434 [==============================] - 5371s 835ms/step - loss: 0.3707 - rpn_out_class_loss: 0.1266 - rpn_out_regress_loss: 0.0121 - dense_class_td_loss: 0.1224 - dense_regress_td_loss: 0.0753 - dense_class_td_acc: 0.9672
Epoch 8/12
6434/6434 [==============================] - 5372s 835ms/step - loss: 0.3709 - rpn_out_class_loss: 0.1266 - rpn_out_regress_loss: 0.0121 - dense_class_td_loss: 0.1226 - dense_regress_td_loss: 0.0757 - dense_class_td_acc: 0.9672

Evaluation:

2020-05-22 21:39:41,472 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/test.pyc: 1046/1047
2020-05-22 21:39:41,767 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/test.pyc: Elapsed time = 0.294595956802
================================================================================
Class AP precision recall RPN_recall

cyclist 0.0000 0.0000 0.0000 0.0425

car 0.0000 0.0000 0.0000 0.1037

person 0.0000 0.0000 0.0000 0.0437

mAP = 0.0000

I went ahead and tried to execute the model export sections. But even this is not working.

Using TensorFlow backend.
Traceback (most recent call last):
File “/usr/local/bin/tlt-export”, line 8, in
sys.exit(main())
File “./common/export/app.py”, line 221, in main
File “./common/export/base_exporter.py”, line 69, in set_keras_backend_dtype
File “./common/utils.py”, line 189, in get_decoded_filename
IOError: Invalid decryption. Unable to open file (File signature not found). The key used to load the model is incorrect.

Can you please help? Thanks

Morganh · May 23, 2020, 2:32am

Are you using 2.0_dp version docker now? If yes, please recheck your images/labels. Because for new 2.0_dp docker, faster-rcnn does not support training on images of multiple resolutions, or resizing images during training. So, all of the images must be resized offline to the final training size and the corresponding bounding boxes must be scaled accordingly.
For your 2nd question, please make sure the API key is correct.

cbasavaraj · May 23, 2020, 12:14pm

The images are from the KITTI dataset. As per the Jupyter notebook, I downloaded the zip folders, unzipped them, and converted them to TF Records. When I ran the visualisation later, I noticed that the images were all slightly different in size (e.g. 1224 x 370, 1242 x 375, …) I’d assumed they would all be the size specified in the default specs. But don’t you think the problem here is during evaluation and inference, and not necessarily during training?

For the key, I’ve double checked. It’s the same key that I used for downloading the data and for training, both steps which worked ok. Only thing I could try is to generate a new key as this key is from January.

Morganh · May 24, 2020, 3:42am

Yes, if you were using KITTI dataset, you need not resize because your setting in training spec matches the average resolution of KITTI dataset. Unfortunately, I can reproduce the issue you mentioned, I will sync with internal team about that. Will update to you if there is any finding.
For the key, you need not to generate a new key. Just please to confirm

key is correct
$key is not empty , $key is correct
you were training this tlt model with the same key.

cbasavaraj · May 25, 2020, 9:48pm

Hi @morganh, the key issue seems to be fine now. I think I had a mistake with single quote marks around the key ‘$KEY’. Please let me know when you have some news on the evaluation / inference. Thanks!

Morganh · June 5, 2020, 7:39am

For mAP issue, looks like the pretrained weights is not so good. So, please do not freeze any CNN blocks in the spec file, i.e., do not specify any freeze_blocks in it. Also do not freeze_bn.
-freeze_bn: True
-freeze_blocks: 0
-freeze_blocks: 1
+freeze_bn: False

Try to run training with batchsize 1 on a single GPU. More, you may meet OOM error since ResNet101 is a big backbone which requires more GPU memory. In this case, please try another GPU.
If user have more GPU memory, you can also increase batch size to get better mAP. But basically, ResNet101 is big and cannot use batch size 16 on a single GPU.

cbasavaraj · June 6, 2020, 4:50pm

Hi, I did the changes you’ve mentioned for BatchNorm, and trained with batch size = 1. Training was better this time as losses kept going down for all 12 epochs, but eval is poor, especially for non-car classes:

================================================================================
Class               AP                  precision           recall              RPN_recall          
--------------------------------------------------------------------------------
cyclist             0.0000              0.0000              0.0000              0.3538              
--------------------------------------------------------------------------------
car                 0.3041              0.9744              0.3058              0.6382              
--------------------------------------------------------------------------------
person              0.0000              0.0000              0.0000              0.3879              
--------------------------------------------------------------------------------
mAP = 0.1014

I could train for more epochs, but in January I had better results with ResNet-50 with fewer epochs. I think there’s still something wrong with ResNet-101 or this release of tlt.

January results with ResNet-50:

================================================================================
Class               AP                  precision           recall              
--------------------------------------------------------------------------------
Cyclist             0.5365              0.4578              0.6023              
--------------------------------------------------------------------------------
Pedestrian          0.5150              0.6083              0.5689              
--------------------------------------------------------------------------------
Car                 0.7911              0.7807              0.8109              
--------------------------------------------------------------------------------
mAP = 0.6142

For reference, here are the training logs (from today):

==================================================================================================
Total params: 79,869,949
Trainable params: 79,707,261
Non-trainable params: 162,688
__________________________________________________________________________________________________
Epoch 1/12
6434/6434 [==============================] - 6167s 959ms/step - loss: 0.7579 - rpn_out_class_loss: 0.0416 - rpn_out_regress_loss: 0.0153 - dense_class_td_loss: 0.1394 - dense_regress_td_loss: 0.1394 - dense_class_td_acc: 0.9499
Epoch 2/12
6434/6434 [==============================] - 5658s 879ms/step - loss: 0.4526 - rpn_out_class_loss: 0.0278 - rpn_out_regress_loss: 0.0084 - dense_class_td_loss: 0.1112 - dense_regress_td_loss: 0.1220 - dense_class_td_acc: 0.9595
Epoch 3/12
6434/6434 [==============================] - 5619s 873ms/step - loss: 0.3876 - rpn_out_class_loss: 0.0244 - rpn_out_regress_loss: 0.0072 - dense_class_td_loss: 0.0999 - dense_regress_td_loss: 0.1124 - dense_class_td_acc: 0.9631
Epoch 4/12
6434/6434 [==============================] - 5580s 867ms/step - loss: 0.3577 - rpn_out_class_loss: 0.0224 - rpn_out_regress_loss: 0.0066 - dense_class_td_loss: 0.0952 - dense_regress_td_loss: 0.1079 - dense_class_td_acc: 0.9649
Epoch 5/12
6434/6434 [==============================] - 5548s 862ms/step - loss: 0.3255 - rpn_out_class_loss: 0.0198 - rpn_out_regress_loss: 0.0062 - dense_class_td_loss: 0.0881 - dense_regress_td_loss: 0.1018 - dense_class_td_acc: 0.9672
Epoch 6/12
6434/6434 [==============================] - 5529s 859ms/step - loss: 0.3097 - rpn_out_class_loss: 0.0190 - rpn_out_regress_loss: 0.0059 - dense_class_td_loss: 0.0842 - dense_regress_td_loss: 0.0983 - dense_class_td_acc: 0.9685
Epoch 7/12
6434/6434 [==============================] - 5502s 855ms/step - loss: 0.2943 - rpn_out_class_loss: 0.0180 - rpn_out_regress_loss: 0.0057 - dense_class_td_loss: 0.0805 - dense_regress_td_loss: 0.0944 - dense_class_td_acc: 0.9699
Epoch 8/12
6434/6434 [==============================] - 5496s 854ms/step - loss: 0.2834 - rpn_out_class_loss: 0.0172 - rpn_out_regress_loss: 0.0056 - dense_class_td_loss: 0.0788 - dense_regress_td_loss: 0.0925 - dense_class_td_acc: 0.9704
Epoch 9/12
6434/6434 [==============================] - 5489s 853ms/step - loss: 0.2748 - rpn_out_class_loss: 0.0165 - rpn_out_regress_loss: 0.0055 - dense_class_td_loss: 0.0769 - dense_regress_td_loss: 0.0904 - dense_class_td_acc: 0.9713
Epoch 10/12
6434/6434 [==============================] - 5478s 851ms/step - loss: 0.2733 - rpn_out_class_loss: 0.0166 - rpn_out_regress_loss: 0.0054 - dense_class_td_loss: 0.0756 - dense_regress_td_loss: 0.0897 - dense_class_td_acc: 0.9717
Epoch 11/12
6434/6434 [==============================] - 5482s 852ms/step - loss: 0.2600 - rpn_out_class_loss: 0.0153 - rpn_out_regress_loss: 0.0052 - dense_class_td_loss: 0.0732 - dense_regress_td_loss: 0.0869 - dense_class_td_acc: 0.9723
Epoch 12/12
6434/6434 [==============================] - 5490s 853ms/step - loss: 0.2587 - rpn_out_class_loss: 0.0154 - rpn_out_regress_loss: 0.0052 - dense_class_td_loss: 0.0725 - dense_regress_td_loss: 0.0861 - dense_class_td_acc: 0.9727

Morganh · June 8, 2020, 2:56am

Thanks for the details. We are still checking the mAP too. Several comments here.

The resnet101 is a big network. Training with a big backbone(like resnet101) against a small dataset(like KITTI ) seems to be not good.
We find that the intermediate model may have a better validation mAP. Next release(2.0 GA) in faster-rcnn will implement validation during training. It is convenient to check the mAP periodically.
For resnet50 you mentioned, could you please check mAP result in 2.0_dp docker comparing to 1.0.1 docker?

cbasavaraj · June 9, 2020, 8:56pm

Hi, I trained with ResNet50 last night. Evaluated just now, with object_confidence_thres: 0.50, and got these results:

================================================================================
Class               AP                  precision           recall              RPN_recall          
--------------------------------------------------------------------------------
cyclist             0.6452              0.4140              0.7264              0.9151              
--------------------------------------------------------------------------------
car                 0.8536              0.8128              0.8679              0.9846              
--------------------------------------------------------------------------------
person              0.6000              0.5253              0.6587              0.9013              
--------------------------------------------------------------------------------
mAP = 0.6996

Note that I had

freeze_bn: True
freeze_blocks: 0
freeze_blocks: 1

during training.
So looks like the problem is only with ResNet-101.
I doubt it’s the size of the dataset that’s causing the problem. I’ve trained Faster-RCNN with R101 in other frameworks (Tensorflow & PyTorch) with quite small datasets and had good results.

Morganh · June 10, 2020, 2:09am

We will dig out more for resnet101.
One question, which pretrained weights did you use to train the ResNet101 FasterRCNN on tensorflow?

zhimengf · June 10, 2020, 2:36am

Hi chandrachud,

Can you share the code base for you tensorflow ResNet101 FasterRCNN traning? Basically I would like to know batch size you used in this training, ResNet101 is a huge backbone and can not fit into a single GPU with a large batch size like 16. So the moving mean and moving variance is not good in this case.

BTW, what batch size did you use when you train ResNet50 in TLT?

Thanks
Zhimeng

cbasavaraj · June 10, 2020, 8:02pm

Sorry I was mistaken. I have only used ResNet-18 and ResNet-50 in TensorFlow. In PyTorch I’ve used ResNet-101 (available from Torchvision), but I guess to convert the pretrained weights to something compatible, it’ll be complicated.

cbasavaraj · June 10, 2020, 8:05pm

For both ResNet101 and ResNet50, I used the default batch size in the config file of 1. Didn’t change anything except for paths to images.

Like I said in my reply to Morgan, I haven’t actually used ResNet101 in Tensorflow. Apologies for the error.

Pretrained weights seem to be available for tensorflow.keras. Will this work for nvidia-tlt? Not sure what the relation is between the batch size used for pre-training on ImageNet and our training as part of faster-rcnn? Small batch size for faster-rcnn may be acceptable, even if it’s slower than ideal.

zhimengf · June 11, 2020, 4:59am

Hi cbasavaraj,

An arbitrary pretrained weights found on Internet will not be able to loaded into a TLT FasterRCNN training since the weights are loaded by name and depends on the implementation. The training batch size of TLT FasterRCNN is not related to the ImageNet training.

Will update you more later.

Thanks
Zhimeng

Morganh · June 12, 2020, 3:01am

@cbasavaraj
NV internal team changed the optimizer to SGD and finetuned learning rate scheduler, the mAP can reach 49% now. Please try on your side too.Thanks.

Attach the training spec for your reference.

##Copyright (c) 2017-2020, NVIDIA CORPORATION. All rights reserved.
random_seed: 42
enc_key: ‘tlt’
verbose: True
network_config {
input_image_config {
image_type: RGB
image_channel_order: ‘bgr’
size_height_width {
height: 384
width: 1248
}
image_channel_mean {
key: ‘b’
value: 103.939
}
image_channel_mean {
key: ‘g’
value: 116.779
}
image_channel_mean {
key: ‘r’
value: 123.68
}
image_scaling_factor: 1.0
max_objects_num_per_image: 100
}
feature_extractor: “resnet:101”
anchor_box_config {
scale: 64.0
scale: 128.0
scale: 256.0
ratio: 1.0
ratio: 0.5
ratio: 2.0
}
freeze_bn: False
roi_mini_batch: 256
rpn_stride: 16
conv_bn_share_bias: True
roi_pooling_config {
pool_size: 7
pool_size_2x: False
}
all_projections: True
use_pooling:False
}
training_config {
kitti_data_config {
data_sources: {
tfrecords_path: “/home/zhimengf/zhimengf_ws/pascal_voc/kitti_random_split/training/tfrecords/kitti_trainval/kitti_trainval*”
image_directory_path: “/home/zhimengf/zhimengf_ws/pascal_voc/kitti_random_split/training”
}
image_extension: ‘png’
target_class_mapping {
key: ‘car’
value: ‘car’
}
target_class_mapping {
key: ‘van’
value: ‘car’
}
target_class_mapping {
key: ‘pedestrian’
value: ‘person’
}
target_class_mapping {
key: ‘person_sitting’
value: ‘person’
}
target_class_mapping {
key: ‘cyclist’
value: ‘cyclist’
}
validation_fold: 0
}
data_augmentation {
preprocessing {
output_image_width: 1248
output_image_height: 384
output_image_channel: 3
min_bbox_width: 1.0
min_bbox_height: 1.0
}
spatial_augmentation {
hflip_probability: 0.5
vflip_probability: 0.0
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 0
translate_max_y: 0
}
color_augmentation {
hue_rotation_max: 0.0
saturation_shift_max: 0.0
contrast_scale_max: 0.0
contrast_center: 0.5
}
}
enable_augmentation: True
batch_size_per_gpu: 1
num_epochs: 12
pretrained_weights: “/home/zhimengf/zhimengf_ws/pascal_voc/resnet_101.hdf5”
output_model: “/home/zhimengf/zhimengf_ws/frcnn_2/train_135/frcnn_kitti_resnet101.tlt”
rpn_min_overlap: 0.3
rpn_max_overlap: 0.7
classifier_min_overlap: 0.0
classifier_max_overlap: 0.5
gt_as_roi: False
std_scaling: 1.0
classifier_regr_std {
key: ‘x’
value: 10.0
}
classifier_regr_std {
key: ‘y’
value: 10.0
}
classifier_regr_std {
key: ‘w’
value: 5.0
}
classifier_regr_std {
key: ‘h’
value: 5.0
}

rpn_mini_batch: 256
rpn_pre_nms_top_N: 12000
rpn_nms_max_boxes: 2000
rpn_nms_overlap_threshold: 0.7

reg_config {
reg_type: ‘L2’
weight_decay: 1e-4
}

optimizer {
sgd {
lr: 0.02
momentum: 0.9
decay: 0.0
nesterov: False
}
}

lr_scheduler {
soft_start {
base_lr: 0.02
start_lr: 0.002
soft_start: 0.1
annealing_points: 0.8
annealing_points: 0.9
annealing_divider: 10.0
}
}

lambda_rpn_regr: 1.0
lambda_rpn_class: 1.0
lambda_cls_regr: 1.0
lambda_cls_class: 1.0

inference_config {
images_dir: ‘/home/zhimengf/zhimengf_ws/pascal_voc/kitti_random_split/testing/image_2’
model: ‘/home/zhimengf/zhimengf_ws/frcnn_2/train_135/frcnn_kitti_resnet101.epoch12.tlt’
detection_image_output_dir: ‘/home/zhimengf/zhimengf_ws/frcnn_2/train_135/inference_results_imgs’
labels_dump_dir: ‘/home/zhimengf/zhimengf_ws/frcnn_2/train_135/inference_dump_labels’
rpn_pre_nms_top_N: 6000
rpn_nms_max_boxes: 300
rpn_nms_overlap_threshold: 0.7
bbox_visualize_threshold: 0.6
classifier_nms_max_boxes: 300
classifier_nms_overlap_threshold: 0.3
}

evaluation_config {
model: ‘/home/zhimengf/zhimengf_ws/frcnn_2/train_135/frcnn_kitti_resnet101.epoch12.tlt’
labels_dump_dir: ‘/home/zhimengf/zhimengf_ws/frcnn_2/train_135/test_dump_labels’
rpn_pre_nms_top_N: 6000
rpn_nms_max_boxes: 300
rpn_nms_overlap_threshold: 0.7
classifier_nms_max_boxes: 300
classifier_nms_overlap_threshold: 0.3
object_confidence_thres: 0.0001
use_voc07_11point_metric:False
}

}

cbasavaraj · June 12, 2020, 8:25am

Thanks, I’ll try tonight. Can mAP go higher if you increase the object_confidence_thres in your config? I already had mAP = 0.6996 with ResNet50 and threshold of 0.50

Morganh · June 12, 2020, 9:19am

That’s possible when change the threshold during evaluation .

cbasavaraj · June 12, 2020, 9:42pm

I lowered the threshold to 0.01 for ResNet50 and get the following results:

================================================================================
Class               AP                  precision           recall              RPN_recall          
--------------------------------------------------------------------------------
cyclist             0.6507              0.3618              0.7406              0.9151              
--------------------------------------------------------------------------------
car                 0.8540              0.8095              0.8683              0.9846              
--------------------------------------------------------------------------------
person              0.6051              0.4963              0.6685              0.9013              
--------------------------------------------------------------------------------
mAP = 0.7032

Slightly higher recalls and lower precisions, giving almost the same mAP as with threshold = 0.50.

So I doubt for ResNet101, the threshold will make a big change in the mAP. I think I’ll take a break and come back to this later.

Thanks for all your efforts.

cbasavaraj · November 5, 2020, 9:54pm

Hello,
I noticed that DeepStream 5.0.1 was released a couple of weeks ago, and TLT has also been updated. Does this mean that Faster RCNN with ResNet-101 is training well and gives good Average Precisions now? Thanks

Morganh · November 6, 2020, 6:40am

As of today, the latest tlt docker is nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3.
See Transfer Learning Toolkit for Video Streaming Analytics | NVIDIA NGC

Topic		Replies	Views
Deepstream with tlt resnet50 model giving unknown warning DeepStream SDK	11	688	October 12, 2021
Training Custom FasterRCNN resnet50 Object detection issue TAO Toolkit	9	1117	October 12, 2021
Faster RCNN on TLT 3.0 not learning the same as TLT 2.0 TAO Toolkit	15	1005	October 12, 2021
High ram usage with tlt ResNet TAO Toolkit	42	996	July 6, 2022
AP, precision and recall are remaining zero using the custom dataset. training the fasterRCNN with resnet_18 TAO Toolkit	12	732	November 5, 2021
IndexError: index 6 is out of bounds for axis 1 with size 6 while training by using FasterRCNN. TAO Toolkit	23	3926	October 12, 2021
GRAYSCALE as image_type not working with tlt-train faster_rcnn TAO Toolkit	13	672	October 12, 2021
Poor metric results after retraining maskrcnn using TLT notebook TAO Toolkit	23	2402	August 3, 2021
Error while re-training with custom dataset using tlt file- FasterRCNN TAO Toolkit	5	354	June 26, 2023
FasterRCNN evaluation when validation set contains just one class and the model was trained with 80 classes TAO Toolkit	11	516	December 4, 2021

Faster RCNN ResNet-101 Problems

Related topics