Hi, I am retrying nvidia-tlt after more than three months, due to the release of DeepStream 5.0 and and other improvements, notably the availability of ResNet-101 as backbone. In order to recall how everything works, I am going through the ipynb example. The config file is exactly as it is the Docker container, except for some paths. I have two problems at the moment:
Training worked ok, as losses decreased during the first few epochs, though they started going up after epoch 4 or so. But evaluation is terrible: I am getting zero mAP, zero precision and zero recall. I ran the visualisation, and turns out the model is predicting the same box at the right bottom corner for every image.
2020-05-22 21:39:41,472 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/test.pyc: 1046/1047
2020-05-22 21:39:41,767 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/test.pyc: Elapsed time = 0.294595956802
================================================================================
Class AP precision recall RPN_recall
cyclist 0.0000 0.0000 0.0000 0.0425
car 0.0000 0.0000 0.0000 0.1037
person 0.0000 0.0000 0.0000 0.0437
mAP = 0.0000
I went ahead and tried to execute the model export sections. But even this is not working.
Using TensorFlow backend.
Traceback (most recent call last):
File “/usr/local/bin/tlt-export”, line 8, in
sys.exit(main())
File “./common/export/app.py”, line 221, in main
File “./common/export/base_exporter.py”, line 69, in set_keras_backend_dtype
File “./common/utils.py”, line 189, in get_decoded_filename
IOError: Invalid decryption. Unable to open file (File signature not found). The key used to load the model is incorrect.
Are you using 2.0_dp version docker now? If yes, please recheck your images/labels. Because for new 2.0_dp docker, faster-rcnn does not support training on images of multiple resolutions, or resizing images during training. So, all of the images must be resized offline to the final training size and the corresponding bounding boxes must be scaled accordingly.
For your 2nd question, please make sure the API key is correct.
The images are from the KITTI dataset. As per the Jupyter notebook, I downloaded the zip folders, unzipped them, and converted them to TF Records. When I ran the visualisation later, I noticed that the images were all slightly different in size (e.g. 1224 x 370, 1242 x 375, …) I’d assumed they would all be the size specified in the default specs. But don’t you think the problem here is during evaluation and inference, and not necessarily during training?
For the key, I’ve double checked. It’s the same key that I used for downloading the data and for training, both steps which worked ok. Only thing I could try is to generate a new key as this key is from January.
Yes, if you were using KITTI dataset, you need not resize because your setting in training spec matches the average resolution of KITTI dataset. Unfortunately, I can reproduce the issue you mentioned, I will sync with internal team about that. Will update to you if there is any finding.
For the key, you need not to generate a new key. Just please to confirm
key is correct
$key is not empty , $key is correct
you were training this tlt model with the same key.
Hi @morganh, the key issue seems to be fine now. I think I had a mistake with single quote marks around the key ‘$KEY’. Please let me know when you have some news on the evaluation / inference. Thanks!
For mAP issue, looks like the pretrained weights is not so good. So, please do not freeze any CNN blocks in the spec file, i.e., do not specify any freeze_blocks in it. Also do not freeze_bn.
-freeze_bn: True
-freeze_blocks: 0
-freeze_blocks: 1
+freeze_bn: False
Try to run training with batchsize 1 on a single GPU. More, you may meet OOM error since ResNet101 is a big backbone which requires more GPU memory. In this case, please try another GPU.
If user have more GPU memory, you can also increase batch size to get better mAP. But basically, ResNet101 is big and cannot use batch size 16 on a single GPU.
Hi, I did the changes you’ve mentioned for BatchNorm, and trained with batch size = 1. Training was better this time as losses kept going down for all 12 epochs, but eval is poor, especially for non-car classes:
================================================================================
Class AP precision recall RPN_recall
--------------------------------------------------------------------------------
cyclist 0.0000 0.0000 0.0000 0.3538
--------------------------------------------------------------------------------
car 0.3041 0.9744 0.3058 0.6382
--------------------------------------------------------------------------------
person 0.0000 0.0000 0.0000 0.3879
--------------------------------------------------------------------------------
mAP = 0.1014
I could train for more epochs, but in January I had better results with ResNet-50 with fewer epochs. I think there’s still something wrong with ResNet-101 or this release of tlt.
January results with ResNet-50:
================================================================================
Class AP precision recall
--------------------------------------------------------------------------------
Cyclist 0.5365 0.4578 0.6023
--------------------------------------------------------------------------------
Pedestrian 0.5150 0.6083 0.5689
--------------------------------------------------------------------------------
Car 0.7911 0.7807 0.8109
--------------------------------------------------------------------------------
mAP = 0.6142
For reference, here are the training logs (from today):
Thanks for the details. We are still checking the mAP too. Several comments here.
The resnet101 is a big network. Training with a big backbone(like resnet101) against a small dataset(like KITTI ) seems to be not good.
We find that the intermediate model may have a better validation mAP. Next release(2.0 GA) in faster-rcnn will implement validation during training. It is convenient to check the mAP periodically.
For resnet50 you mentioned, could you please check mAP result in 2.0_dp docker comparing to 1.0.1 docker?
Hi, I trained with ResNet50 last night. Evaluated just now, with object_confidence_thres: 0.50, and got these results:
================================================================================
Class AP precision recall RPN_recall
--------------------------------------------------------------------------------
cyclist 0.6452 0.4140 0.7264 0.9151
--------------------------------------------------------------------------------
car 0.8536 0.8128 0.8679 0.9846
--------------------------------------------------------------------------------
person 0.6000 0.5253 0.6587 0.9013
--------------------------------------------------------------------------------
mAP = 0.6996
Note that I had
freeze_bn: True
freeze_blocks: 0
freeze_blocks: 1
during training.
So looks like the problem is only with ResNet-101.
I doubt it’s the size of the dataset that’s causing the problem. I’ve trained Faster-RCNN with R101 in other frameworks (Tensorflow & PyTorch) with quite small datasets and had good results.
Can you share the code base for you tensorflow ResNet101 FasterRCNN traning? Basically I would like to know batch size you used in this training, ResNet101 is a huge backbone and can not fit into a single GPU with a large batch size like 16. So the moving mean and moving variance is not good in this case.
BTW, what batch size did you use when you train ResNet50 in TLT?
Sorry I was mistaken. I have only used ResNet-18 and ResNet-50 in TensorFlow. In PyTorch I’ve used ResNet-101 (available from Torchvision), but I guess to convert the pretrained weights to something compatible, it’ll be complicated.
For both ResNet101 and ResNet50, I used the default batch size in the config file of 1. Didn’t change anything except for paths to images.
Like I said in my reply to Morgan, I haven’t actually used ResNet101 in Tensorflow. Apologies for the error.
Pretrained weights seem to be available for tensorflow.keras. Will this work for nvidia-tlt? Not sure what the relation is between the batch size used for pre-training on ImageNet and our training as part of faster-rcnn? Small batch size for faster-rcnn may be acceptable, even if it’s slower than ideal.
An arbitrary pretrained weights found on Internet will not be able to loaded into a TLT FasterRCNN training since the weights are loaded by name and depends on the implementation. The training batch size of TLT FasterRCNN is not related to the ImageNet training.
@cbasavaraj
NV internal team changed the optimizer to SGD and finetuned learning rate scheduler, the mAP can reach 49% now. Please try on your side too.Thanks.
Thanks, I’ll try tonight. Can mAP go higher if you increase the object_confidence_thres in your config? I already had mAP = 0.6996 with ResNet50 and threshold of 0.50
Hello,
I noticed that DeepStream 5.0.1 was released a couple of weeks ago, and TLT has also been updated. Does this mean that Faster RCNN with ResNet-101 is training well and gives good Average Precisions now? Thanks