Training doesn't converge for Mapillary Vistas Dataset training with MaskRCNN

Morganh · June 2, 2022, 1:17pm

Please use one forum user’s dataset (TLT train maskrcnn model with Mapillary Vistas Dataset failed on CUDA_ERROR_OUT_OF_MEMORY: out of memory - #50 by sorata118) or generate tfrecords via the method mentioned in TLT train maskrcnn model with Mapillary Vistas Dataset failed on CUDA_ERROR_OUT_OF_MEMORY: out of memory - #41 by Morganh and then retry.

edit_or · June 2, 2022, 11:45pm

Because mine is not memory issue. So if i set image_size: “(128, 128)”#“(832, 1344)” to smaller size, do I still need to resize images and json files?

edit_or · June 3, 2022, 7:57am

I found this paper.
Yes number of classes is 37 not 147.
Then to increase the accuracy I need to filter out small object classes by setting threshold.
Their loss is also around 2.

edit_or · June 4, 2022, 1:57am

I have a few queries
(1)Is it possible to train a few last layers of resnet50 in TAO for MaskRCNN? I just want to finetune heads of maskrcnn, how can I do that?
(2)Which software can I use to view annotations (json) and images of dataset?

Morganh · June 5, 2022, 10:13am

Refer to “freeze_blocks” in MaskRCNN - NVIDIA Docs

What do you mean viewing the json file?

Below is an example. When train on COCO dataset, the spec is inside this topic.

edit_or · June 9, 2022, 8:41am

(1) How many blocks should I freeze in freeze_blocks if I like to train only MaskRcnn heads using Resnet50?
My accuracy still not good.

AP: 0.149292663
AP50: 0.231117696
AP75: 0.147759020
APl: 0.170746893
APm: 0.034814980
APs: 0.002098019
ARl: 0.381483465
ARm: 0.067172319
ARmax1: 0.236285821
ARmax10: 0.323538810
ARmax100: 0.327527165
ARs: 0.009126985
mask_AP: 0.111917272
mask_AP50: 0.183239624
mask_AP75: 0.114607982
mask_APl: 0.127348930
mask_APm: 0.007120144
mask_APs: 0.000000000
mask_ARl: 0.247205675
mask_ARm: 0.028806869
mask_ARmax1: 0.168297976
mask_ARmax10: 0.207456693
mask_ARmax100: 0.209397107
mask_ARs: 0.000000000

(2) My number of images are 18000.
So for number of steps in config file should be
total_steps = total_images * total_epochs / batch_size / nGPUs
total_steps = 18000 * 20/ 4/ 1 = 90,000
Is it correct?

(3)When I train on 4Gpus with the following command

!tao mask_rcnn train -e $SPECS_DIR/maskrcnn_train_resnet50.txt
-d $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
–gpus 4

I have error as
[MaskRCNN] ERROR : Job finished with an uncaught exception: `FAILURE

The whole info during training with multiple Gpus is attached.
error.txt (163.8 KB)

Morganh · June 9, 2022, 4:20pm

From Open Images Pre-trained Instance Segmentation — TAO Toolkit 3.22.05 documentation, the resnet50.hdf5 is pre-trained weight which is trained on a subset of the Google OpenImages dataset.
For the ResNet series, the block IDs valid for freezing are any subset of [0, 1, 2, 3] (inclusive).

There is no baseline for Mapillary dataset. So we cannot draw a conclusion that whether the accuracy is good or not.

Yes, total_steps = total_images * total_epochs / batch_size / nGPUs
Please use a new result folder when you type the training command.

edit_or · June 10, 2022, 12:21am

Yes I always use a new folder for training. But I have that error when train with multiple GPUs
The following errors happened training with 4 GPUs.
NCCL WARN Error while creating shared memory segment nccl-shm-recv-372f958f789c7514-0-3-0

adae3ba4da04:168:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
adae3ba4da04:168:694 [0] NCCL INFO include/shm.h:41 -> 2

adae3ba4da04:168:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-bac3e042d6e8f0a2-0-3-2 (size 9637888)
adae3ba4da04:168:694 [0] NCCL INFO transport/shm.cc:100 -> 2
adae3ba4da04:168:694 [0] NCCL INFO transport.cc:34 -> 2
adae3ba4da04:168:694 [0] NCCL INFO transport.cc:84 -> 2
adae3ba4da04:168:694 [0] NCCL INFO init.cc:753 -> 2
adae3ba4da04:168:694 [0] NCCL INFO init.cc:867 -> 2
adae3ba4da04:168:694 [0] NCCL INFO init.cc:903 -> 2
adae3ba4da04:168:694 [0] NCCL INFO init.cc:916 -> 2

adae3ba4da04:166:702 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
adae3ba4da04:166:702 [0] NCCL INFO include/shm.h:41 -> 2

adae3ba4da04:166:702 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-372f958f789c7514-0-3-0 (size 9637888)
adae3ba4da04:166:702 [0] NCCL INFO transport/shm.cc:100 -> 2
adae3ba4da04:166:702 [0] NCCL INFO transport.cc:34 -> 2
adae3ba4da04:166:702 [0] NCCL INFO transport.cc:84 -> 2
adae3ba4da04:166:702 [0] NCCL INFO init.cc:742 -> 2
adae3ba4da04:166:702 [0] NCCL INFO init.cc:867 -> 2
adae3ba4da04:166:702 [0] NCCL INFO init.cc:903 -> 2
adae3ba4da04:166:702 [0] NCCL INFO init.cc:916 -> 2

Morganh · June 10, 2022, 4:04pm

For above log, please refer to below topic and its solution.

edit_or · June 11, 2022, 1:15am

Yes as he mentioned, I also changed to
“DockerOptions”: {
“shm_size”: “16G”,
“ulimits”: {
“memlock”: -1,
“stack”: 67108864
}

}
Then it works

Now need to solve only for accuracy issue.
When I check test images, I see segmentation. Sky (not included in training), road (included in training) are segmented. But car has no bounding box.
Then those bounding box and labelled N/A, I don’t understand.
Is there any misunderstanding for me at tested image shown attached.

Morganh · June 11, 2022, 2:34pm

How did you get above annotated image? Can you share the command?

edit_or · June 11, 2022, 3:27pm

This is the command used.

!tao mask_rcnn inference -i $USER_EXPERIMENT_DIR/testimages \
                         -o $USER_EXPERIMENT_DIR/maskrcnn_annotated_images \
                         -e $SPECS_DIR/maskrcnn_train_resnet50.txt \
                         -m $USER_EXPERIMENT_DIR/experiment_dir_unpruned/model.step-90000.tlt \
                         -l $SPECS_DIR/instance_label.txt \
                         -t 0.5 \
                         -k $KEY \
                         --include_mask

Morganh · June 11, 2022, 4:13pm

Can you attach the full training log? More, how about running inference against some training images?

edit_or · June 12, 2022, 1:24am

Thanks for the reply.
My latest training has high and comparable AP50 to the sample training shown by TLT using COCO dataset. AP50=0.33 is quite high.

AP: 0.213585272
AP50: 0.332958937
AP75: 0.210112900

Training log file is attached.
log.txt (6.7 MB)
Total loss reached to (Total loss: 0.916). That is quite good.
My training spec file is also attached.
maskrcnn_train_resnet50.txt (2.1 KB)

The visualization of test images command is as shared before as

!tao mask_rcnn inference -i $USER_EXPERIMENT_DIR/testimages \
                         -o $USER_EXPERIMENT_DIR/maskrcnn_annotated_images \
                         -e $SPECS_DIR/maskrcnn_train_resnet50.txt \
                         -m $USER_EXPERIMENT_DIR/experiment_dir_unpruned/model.step-90000.tlt \
                         -l $SPECS_DIR/instance_label.txt \
                         -t 0.5 \
                         -k $KEY \
                         --include_mask

More of test images are attached.

My objects have car, people, road, etc. as shown in instance_label.txt, but car. people are never detected.
instance_label.txt (173 Bytes)

What could be wrong?

Morganh · June 12, 2022, 3:26am

Could you attach your /workspace/tao-experiments/data/mask_rcnn/instances_shape_validation2020.json ?

I’m afraid you need to set correct num_classes: 125 during training.

edit_or · June 12, 2022, 4:05am

But I am training for 19 classes only. 18 + 1 background.
My instances_shape_validation2020.json is attached.
instances_shape_validation2020.json (19.8 MB)

edit_or · June 12, 2022, 4:28am

Let me retrain with 125 classes.
Here in the original post said 124 classes.

Then he also train with 124 classes.

But in my python code for MapillarytoCOCO conversion, I have only 18 classes as shown in main.py
main.py (9.6 KB)

Original is 37+1 classes

edit_or · June 13, 2022, 1:35am

I changed to 125 classes. Training AP-50 is quite high. Loss is quite low during training.

AP: 0.194935068
AP50: 0.320598215
AP75: 0.191744179

Loss is 0.873
But inference gives still the same.
I don’t understand, sky is not a class in list. But sky is detected with high confidence.

The following are log file and spec file.
log.txt (6.7 MB)
maskrcnn_train_resnet50.txt (2.1 KB)

Test images are still the same.
I don’t know what mistake is made?

edit_or · June 13, 2022, 4:47am

I suspect I confused v1.2 and v2.0. I am using config_v1.2.json but dataset is v2.0. So all colors and classes don’t match. Let me check first. Thanks

Morganh · June 13, 2022, 6:13am

More, from the log, the l2 loss does not decrease.
Please set lower l2 and retry.
l2_weight_decay: 0.00001

Topic		Replies	Views
Input to reshape is a tensor with 3067968 values, but the requested shape has 2691200 TAO Toolkit inception	2	15	January 16, 2025
MaskRCNN Input to reshape is a tensor with 3135248 values, but the requested shape has 2691200 TAO Toolkit	38	1124	May 9, 2023
Mask_rcnn shows training logs loss 0.00000 fast_rcnn class loss: 0.00000 fast_rcnn box loss: 0.00000 TAO Toolkit	22	805	August 16, 2022
Requested more than 0 entries, but params is empty. Params shape: [0,1920,1080] TAO Toolkit	23	1839	March 1, 2022
Poor metric results after retraining maskrcnn using TLT notebook TAO Toolkit	23	2408	August 3, 2021
Training Custom FasterRCNN resnet50 Object detection issue TAO Toolkit	9	1123	October 12, 2021
Training multi-class UNet does not converge TAO Toolkit	31	2949	October 12, 2021
BodyPoseNet trained with custom dataset not detecting TAO Toolkit	21	851	June 6, 2022
Permission denied: 'mrcnn_log.json' while converting data into tfrecords TAO Toolkit	9	902	August 16, 2022
Multi GPU's and invalid loss TAO Toolkit	18	1177	July 19, 2022

Training doesn't converge for Mapillary Vistas Dataset training with MaskRCNN

Related topics